dbuf_whichblock() is not made to handle offsets beyond the block
end for single-block objects. Handle it in dmu_evict_range(),
similar to dmu_prefetch_by_dnode().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399Closes#18489
For now make it only evict the specified data from the dbuf cache.
Even though dbuf cache is small, this may still reduce eviction of
more useful data from there, and slightly accelerate ARC evictions
by making the blocks there evictable a bit sooner.
On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the
kernel translates it into POSIX_FADV_DONTNEED after every read/write.
This is not as efficient as it could be for ZFS, but that is the only
way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399
- For multilevel gang blocks it seemed possible to fallback from
normal to special class, since they don't have proper object type,
and DMU_OT_NONE is a "metadata". They should never fallback.
- Fix possible inversion with zfs_user_indirect_is_special = 0,
when indirects written to normal vdev, while small data to special.
Make small indirect blocks also follow special_small_blocks there.
- With special_small_blocks now applying to both files and ZVOLs,
make it apply to all non-metadata without extra checks, since there
are no other non-metadata types.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18208
When the path argument to "zfs list -Ho name <path>" (or any caller of
zfs_path_to_zhandle()) is a symlink that crosses a mount boundary, the
wrong dataset is returned. Instead of returning the dataset that owns
the symlink's target, getextmntent() matches the dataset containing the
symlink itself.
For example, given two ZFS datasets "tank/ds1" and "tank/ds2", and a
symlink "/tank/ds1/link" pointing into "/tank/ds2":
$ sudo zfs list -Ho name /tank/ds1/link
tank/ds1
The expected (and previous) behavior is to return "tank/ds2", since the
symlink's target resides in that dataset.
The problem is in getextmntent(), in lib/libspl/os/linux/mnttab.c. That
function calls statx() on the caller-supplied path to obtain its mnt_id
(used to match against the mnt_id of each entry in /proc/self/mounts),
and it passes AT_SYMLINK_NOFOLLOW to that statx() call. As a result,
the mnt_id returned reflects the symlink's location rather than the
symlink target's mount, and the wrong /proc/self/mounts entry is
matched.
The same function also calls stat64() on the caller-supplied path
(used as a fallback when STATX_MNT_ID is not available, and to populate
the statbuf out-parameter). stat64() always follows symlinks, so the
statx() and stat64() calls were inconsistent: one resolved the symlink,
the other didn't. The AT_SYMLINK_NOFOLLOW behavior may be appropriate
when statx() is called on a mount entry from /proc/self/mounts (which
is always a real directory), but it is wrong for caller-supplied paths,
which may be symlinks.
This bug was introduced by 523d9d6007 ("Validate mountpoint on
path-based unmount using statx"), which added the STATX_MNT_ID code
path. However, the bug was latent: config/user-statx.m4 omitted
"#define _GNU_SOURCE" when checking for STATX_MNT_ID in <sys/stat.h>,
so HAVE_STATX_MNT_ID was never defined, and the buggy statx() path was
never compiled in. getextmntent() always fell back to the dev_t
comparison via stat64(), which correctly follows symlinks.
The fix to that autoconf check, in 2b930f63f8 ("config: fix
STATX_MNT_ID detection"), caused HAVE_STATX_MNT_ID to be properly
defined on kernels that support it, activating the broken
AT_SYMLINK_NOFOLLOW path for the first time and exposing the
regression.
The fix is to drop AT_SYMLINK_NOFOLLOW from the statx() call so that
symlinks are followed, matching the behavior of stat64() on the same
path.
Verified with a minimal reproducer: created two ZFS datasets, placed a
symlink inside the first pointing into the second, and confirmed that
"zfs list -Ho name <symlink>" returns the dataset containing the
symlink's target rather than the dataset containing the symlink.
Signed-off-by: Prakash Surya <prakash.surya@perforce.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
The spa_sync thread waits on ->spa_txg_zio and will set ZIO_WAIT_DONE
before running the sync tasks. The dmu_tx_commit() call must be done
after we add the child zio to the ->spa_txg_zio parent otherwise its
possible the child is added after txg_sync has waited.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18276
Remove redundant dsl_pool variable and duplicate spa_get_dsl()
call in vdev_rebuild_thread.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#18263
Update freebsd15-0s builder to freebsd15-1s and point it at the
15.1-PRERELEASE tag. The previous freebsd-15.0-STABLE images are
no longer available.
Additionally, add a freebsd15-0r stanza for the RELEASE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
- Add Fedora 44 to CI tests
- Fix build issues from the newer compiler. These are mostly 'char *'
to 'const char *' conversions.
- Fix threadsappend.c test waiting for the same thread TID twice.
This caused the test to hang on F44 (but strangely not other OSs?)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18478
The d_u union introduced in 3.18 is now anonymous, so we need to detect
it and decide the right way to name d_alias.
Note that we used to have support for both names to support kernels
before 3.18, so this commit is effectively reverting the commit that
removed that support, efc293e371.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18471
Only call txg_wait_synced() when rebuild IOs were issued for this
metaslab. This is a small optimization since in practice the first
metaslab is very likely to have allocations and cause vr_last_txg
to be initialized. After this point when processing empty metaslabs
txg_wait_synced() is called but with an already committed txg so it
will not wait. Still it's better not to call txg_wait_synced() at
all when it's not needed.
Reviewed-by: Andriy Tkachuk <atkachuk@wasabi.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18482
Currently, after rebuild (aka sequential resilver), checksum
errors can be seen sometimes on the spare vdev or draid spare.
On my laptop, it happens from 2 to 4 times of running
redundancy_draid_spare1 test in a loop for 100 times.
It looks like there's a race in vdev_rebuild_thread() when the
rebuild of space map ranges is finished and we re-enable
allocations from the metaslab too soon: a new allocations may
happen from that metaslab before txg with the rebuilt ranges is
sync-ed, causing undesirable interference.
Solution: wait for the txg to be sync-ed before enabling metaslab.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18307Closes#18319Closes#18473
When sequentially resilvering a dRAID pool it's possible that a few
correctable checksum errors will be reported. This is a known issue
which is occasionally observed in the CI. Until it's resolved we
want the test case to tolerate a few checksum errors in this scenario
to prevent false positives in the CI.
This change also has the additional side effect of standardizing in
one location how the dRAID pool integrity is verified.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #18307
Issue #18319Closes#18436
Automake's default tar formats (v7 pre-1.18, ustar since) impose path
length limits that drop several long test filenames from the release
tarball when `make dist` runs. Pax format has no such limit and is
read by GNU tar 1.14+ and libarchive/bsdtar.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes: #17276Closes: #18465
- We've seen occasional 'ERROR 502: Bad Gateway' from the runner trying
to download an image with axel. Axel can open multiple connections for
a faster download, so maybe that's causing problems. This commit adds
in a fallback to curl if the axel download doesn't work.
- Update merge_summary.awk to print out killed tests in the summary.
We've seen cases where the summary page was red but there were no test
failures printed. This is because one of the VMs had too may
killed tests, which caused the total test time to run too long and
caused the runner to timeout qemu-6-test.sh. When the runner kills off
qemu-6-tests.sh, it means we never generate the nice summary page
for that VM listing the killed off tests. This commit parses the
partial test logs for killed off tests and includes them in the
merge_summary.awk output.
- Print an error message in the summary page if one of the VMs
didn't complete ZTS. This helps draw attention to a VM crash.
- FreeBSD sometimes has broken links to their CI image. When that
happens, select the newest nightly snapshot image as an alternative.
This is needed right now, since the current images in the FreeBSD 16
"current/" directory are returning 404 errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18460
Fix a bug where an cgroup-OOM-killed process can cause a panic:
usercopy: Kernel memory exposure attempt detected from vmalloc (offset
1007584, size 217120)!
kernel BUG at mm/usercopy.c:102!
This was caused by zfs_uiomove() not correctly returning EFAULT
for short copies.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15918Closes#18408
dmu_write_direct_done() passes dmu_sync_arg_t to
dmu_sync_done(), which updates the override state and
frees the completion context. The Direct I/O error path
then still dereferences dsa->dsa_tx while rolling the
dirty record back with dbuf_undirty(), resulting in a
use-after-free.
Save dsa->dsa_tx in a local variable before calling
dmu_sync_done() and use that saved tx for the error
rollback. This preserves the existing ownership model
for dsa and does not change the Direct I/O write
semantics.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: gality369 <gality369@example.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Closes#18440
Switch to incremental range tree processing in dnode_sync() to avoid
unsafe lock dropping during zfs_range_tree_walk(). This also ensures
the free ranges remain visible to dnode_block_freed() throughout the
sync process, preventing potential stale data reads.
This patch:
- Keeps the range tree attached during processing for visibility.
- Processes segments one-by-one by restarting from the tree head.
- Uses zfs_range_tree_clear() to safely handle ranges that may have
been modified while the lock was dropped.
- adds ASSERT()s to document that we don't expect dn_free_ranges
modification outside of sync context.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Issue #18186Closes#18235
zfs_range_tree_remove_impl() used a bare panic() when a segment to be
removed was not completely overlapped by an existing tree entry. Every
other consistency check in range_tree.c uses zfs_panic_recover(), which
respects the zfs_recover tunable and allows pools with on-disk
corruption to be imported and recovered. This one call was
inconsistent, making the partial-overlap case unrecoverable regardless
of zfs_recover.
Replace panic() with zfs_panic_recover() so that operators can set
zfs_recover=1 to import a corrupted pool and reclaim data, consistent
with all other range tree error paths.
Related-to: https://github.com/openzfs/zfs/issues/13483
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#18255
The Makefile.am files from libshare, libtpool, libunicode, and libuutil
do not have SPDX lines. This is because those Makefiles only got SPDX
lines after the big Makefile merge in commits like 309006a0c and
0d44b58d7 (which have not been ported to this branch). Add the
Makefiles to the whitelist here so spdxcheck.pl passes.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
When copy_file_range overwrites a recent truncation, subsequent reads
can incorrectly determine that it is read hole instead of reading the
cloned blocks.
This can happen when the following conditions are met:
- Truncate adds blkid to dn_free_ranges
- A new TXG is created
- copy_file_range calls dmu_brt_clone which override the block pointer
and set DB_NOFILL
- Subsequent read, given DB_NOFILL, hits dbuf_read_impl and
dbuf_read_hole
- dbuf_read_hole calls dnode_block_freed, which returns TRUE because the
truncated blkids are still in dn_free_ranges
This will not happen if the clone and truncate are in the same TXG,
because the block clone would update the current TXG's dn_free_ranges,
which is why this bug only triggers under high IO load (such as
compilation).
Fix this by skipping the dnode_block_freed call if the block is
overridden. The fix shouldn't cause an issue when the cloned block is
subsequently freed in later TXGs, as dbuf_undirty would remove the
override.
This requires a dedicated test program as it is much harder to trigger
with scripts (this needs to generate a lot of I/O in short period of
time for the bug to trigger reliably).
Assisted-by: Gemini:gemini-3.1-pro
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gary Guo <gary@kernel.org>
Closes#18412Closes#18421
zfsctl_snapshot_mount() holds z_teardown_lock(R) across
call_usermodehelper(), which spawns a mount process that needs
namespace_sem(W) via move_mount. Reading /proc/self/mountinfo holds
namespace_sem(R) and needs z_teardown_lock(R) via zpl_show_devname.
When zfs_suspend_fs (from zfs recv or zfs rollback) queues
z_teardown_lock(W), the rrwlock blocks new readers, completing the
deadlock cycle.
Fix by releasing z_teardown_lock(R) after gathering the dataset name
and mount path, before any blocking operation. Everything after the
release operates on local string copies or uses its own
synchronization. The parent zfsvfs pointer remains valid because the
caller holds a path reference to the automount trigger dentry.
Releasing the lock allows zfs_suspend_fs to proceed concurrently
with the mount helper, so dmu_objset_hold in zpl_get_tree can
transiently fail with ENOENT during the clone swap. The mount
helper fails, EISDIR is returned, and the VFS falls back to the
ctldir stub (empty directory) until the next access retries.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18415
When getzfsvfs() succeeds (incrementing s_active via
zfs_vfs_ref()), but z_unmounted is subsequently found to
be B_TRUE, zfsvfs_hold() returns EBUSY without calling
zfs_vfs_rele(). This permanently leaks the VFS superblock
s_active reference, preventing generic_shutdown_super()
from ever firing, which blocks dmu_objset_disown() and
makes the pool permanently unexportable (EBUSY).
Add the missing zfs_vfs_rele() call, guarded by
zfs_vfs_held() to handle the zfsvfs_create() fallback
path where no VFS reference exists. This matches the
existing cleanup pattern in zfsvfs_rele().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Closes#18309Closes#18310
- Remove line where we disable stdout at the end of qemu-1-setup.sh
- Fix comment switching the 2x75GB -> 1x150GB cases
- Add some more debug to the end of the script
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18441
When a VM fails to launch or is unreachable the qemu-7-prepare.sh
script will fail to collect the artifacts due to the missing vm*
directories. We want to collect as much diagnostic information as
possible, when missing create the directory to allow the subsequent
steps to proceed normally. Additionally, we don't want to fail
if the /tmp/summary.txt file is missing.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18438
We've seen some qemu-1-setup failures while trying to change the
runner's block device scheduler value to 'none':
We have a single 150GB block device
Setting up swapspace version 1, size = 16 GiB (17179865088 bytes)
no label, UUID=7a790bfe-79e5-4e38-b208-9c63fe523294
tee: '/sys/block/s*/queue/scheduler': No such file or directory
Luckily, we don't need to set the scheduler anymore on modern kernels:
https://github.com/openzfs/zfs/issues/9778#issuecomment-569347505
This commit just removes the code that sets the scheduler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18437
Update the META file to reflect compatibility with the 7.0
kernel.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18435
Replace semicolons with && so build failures are not masked by the
subsequent lockfile cleanup. Use trap to ensure the lockfile is
removed on both success and failure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18206Closes#18424
Currently, when more than nparity disks get faulted during the
rebuild, only first nparity disks would go to faulted state, and
all the remaining disks would go to degraded state. When a hot
spare is attached to that degraded disk for rebuild creating the
spare mirror, only that hot spare is getting rebuilt, but not the
degraded device. So when later during scrub some other attached
draid spare happens to map to that spare, it will end up with
cksum error.
Moreover, if the user clears the degraded disk from errors, the
data won't be resilvered to it, hot spare will be detached almost
immediately and the data that was resilvered only to it will be
lost.
Solution: write to all mirrored devices during rebuild, similar
to traditional/healing resilvering, but only if we can verify
the integrity of the data, or when it's the draid spare we are
writing to, in which case we are writing to a reserved spare
space, and there is no danger to overwrite any good data.
The argument that writing only to rebuilding draid spare vdev is
faster than writing to normal device doesn't hold since, at a
specific offset being rebuilt, draid spare will be mapped to a
normal device anyway.
redundancy_draid_degraded2 automation test is added also to
cover the scenario.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18414
The GH artifacts action now lets you disable auto-zipping your
artifacts. Previously, GH would always automatically put your
artifacts in a ZIP file. This is annoying when your artifacts
are already in a tarball.
Also update the following action versions
checkout: v4 -> v6
upload-artifact: v4 -> v7
download-artifact: v4 -> v8
Lastly, fix a issue where zfs-qmeu-packages now needs to power
cycle the VM.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18411
ztest can enable and disable the multihost property when testing.
This can result in a failure when attempting to import an existing
pool when multihost=on but no /etc/hostid file exists. Update the
workflow to use zgenhostid to create /etc/hostid when not present.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18413
When sequentially resilvering allow a dRAID child to be read
as long as the DTLs indicate it should have a good copy of the
data and the leaf isn't being rebuilt. The previous check was
slightly too broad and would skip dRAID spare and replacing
vdevs if one of their children was being replaced. As long
as there exists enough additional redundancy this is fine, but
when there isn't this vdev must be read in order to correctly
reconstruct the missing data.
A new test case has been added which exhausts the available
redundancy, faults another device causing it to be degraded,
and then performs a sequential resilver for the degraded device.
In such a situation enough redundancy exists to perform the
replacement and a scrub should detect no checksum errors.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18405
Similar to FreeBSD stop issuing prefetches on POSIX_FADV_SEQUENTIAL.
It should not have this semantics, only hint speculative prefetcher,
if access ever happen later. Instead after POSIX_FADV_WILLNEED
handling call generic_fadvise(), if available, to do all the generic
stuff, including setting f_mode in struct file, that we could later
use to control prefetcher as part of read/write operations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18395
Free 35GB of unused files, mostly from unused development environments.
This helps with the out of disk space problems we were seeing on
FreeBSD runners.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18400
A cleanup of opportunity. Since we already are modifying the contents of
zfs_mnt_t, we've broken any API guarantee, so we might as well go the
rest of the way and get rid of it, and just pass the osname and/or the
vfs_t directly.
It seems like zfs_mnt_t was never really needed anyway; it was added in
1c2555ef92 (March 2017) to minimise the difference to illumos, but
zfs_vfsops was made platform-specific anyway in 7b4e27232d.
We also remove setting SB_RDONLY on the caller's flags when failing a
read-write remount on a read-only snapshot or pool. Since 0f608aa6ca
the caller's flags have been a pointer back to fc->sb_flags, which are
discarded without further ceremony when the operation fails, so the
change is unnecessary and we can simplify the call further.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Before Linux 5.8 (include RHEL8), a fixed set of "forbidden" options
would be rejected outright. For those, we work around it by providing
our own option parser to avoid the codepath in the kernel that would
trigger it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Adds zpl_parse_param and wires it up to the fs_context. This uses the
kernel's standard mount option parsing infrastructure to keep the work
we need to do to a minimum. We simply fill in the vfs_t we attached to
the fs_context in the previous commit, ready to go for the mount/remount
call.
Here we also document all the options we need to support, and why. It's
a lot of history but in the end the implementation is straightforward.
Finally, if we get SB_RDONLY on the proposed superblock flags, we record
that as the readonly mount option, because we haven't necessarily seen a
"ro" param and we still need to know for remount, the `readonly` dataset
property, etc.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
vfs_t is initially just parameters for the mount or remount operation,
so match them to the lifetime of the fs_context that represents that
operation.
When we actually execute the operation (calling .get_tree or .reconfigure),
transfer ownership of those options to the associated zfsvfs_t.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
We're working to replace this, and its easier to drop it outright while
we get set up.
To keep things compiling, the calls to zfsvfs_parse_options() are
replaced with zfsvfs_vfs_alloc(), though without any option parsing at
all nothing will work. That's ok, next commits are working towards it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
In a few commits, we're going to need to allocate and free vfs_t from
zpl_super.c as well, so lets keep them uniform.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Currently, it's possible that draid vdev asize would decrease
after disks replacements when the disk size is a little less than
all other disks in the pool. In such situations, import would
fail on this check in vdev_open():
/*
* Make sure the allocatable size hasn't shrunk too much.
*/
if (asize < vd->vdev_min_asize) {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_BAD_LABEL);
return (SET_ERROR(EINVAL));
}
Solution: fix vdev_draid_min_asize() so that it would round up
the required minimal disk capacity to the VDEV_DRAID_ROWHEIGHT.
This would refuse replacements with the disks whose size is less
than minimally required to avoid draid asize decrement.
Note: we also use VDEV_DRAID_ROWHEIGHT in vdev_draid_open() when
calculating asize, and thats why we need to round up min_size at
vdev_draid_min_asize() to avoid asize drops.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18380
Normally, kernel gives any LSM registering a `sb_eat_lsm_opts` hook a
first look at mount options coming in from a userspace mount request.
The LSM may process and/or remove any options. Whatever is left is
passed to the filesystem.
This is how the dataset properties `context`, `fscontext`, `defcontext`
and `rootcontext` are used to configure ZFS mounts for SELinux. libzfs
will fetch those properties from the dataset, then add them to the mount
options.
In 0f608aa6ca (#18216) we added our own mount shims to cover the loss of
the kernel-provided ones. It turns out that if a filesystem provides a
`.parse_monolithic callback`, it is expected to do _all_ mount option
parameter processing - the kernel will not get involved at all. Because
of that, LSMs are never given a chance to process mount options. The
`context` properties are never seen by SELinux, nor are any other
options targetting other LSMs.
Fix this by calling `security_sb_eat_lsm_opts()` in
`zpl_parse_monolithic()`, before we stash the remaining options for
`zfs_domount()`.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18376
This function was removed in c6442bd3b6: "Removing old code outside
of 4.18 kernsls", but fails at present on PowerPC builds due to the
recent inclusion of 6bc9c0a90522: "powerpc: fix KUAP warning in VMX
usercopy path" in the upstream kernel, which introduces a use of
cpu_feature_keys[], which is a GPL-only symbol. Removing the API
check as it doesn't appear necessary.
Signed-off-by: John Cabaj <john.cabaj@canonical.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Do a ZFS build inside of an ARM runner. This only does a simple
build, it does not run the test suite. The build runs on the
runner itself rather than in a VM, since nesting is not supported on
Github ARM runners.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18343
Allow restricting ZTS OS targets by setting the vars.ZTS_OS_OVERRIDE
repository variable (e.g. '["debian13"]') to reduce shared runner
contention when running the full OS matrix is unnecessary. When unset,
the existing ci_type-based OS selection is used unchanged.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18342
Target of opportunity; with no other callers, there's no need for it to
be a static function.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
Target of opportunity; with no other callers, there's no need for it to
be a static function.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
With the old API gone, there's no need to massage new-style calls into
its shape and call another function; we can just make those handlers
work directly.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
Removing the HAVE_FS_CONTEXT gates and anything that would be used if it
wasn't set.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
It turns out the kernel can also take directory leases, most notably in
the NFS server. Without a setlease handler on the directory file ops,
attempts to open a directory over NFS can fail with EINVAL.
Adding a directory setlease handler was missed in 168023b603. This fixes
that, allowing directories to be properly accessed over NFS.
Sponsored-by: TrueNAS
Reported-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Observed again in the CI. Put the maybe exception back in place
and reference a newly created issue for this sporadic failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18320
Update the redundancy_draid_spare1 exception to reference an issue
which describes the failure.
Remove the exception for the redundancy_draid_spare3 test. I have
not observed it in local testing. If it reproduces in the CI we
can create a new issue for it and put back the exception.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18308
statx(2) requires _GNU_SOURCE to be defined in order for sys/stat.h to
produce a definition for struct statx and the STATX_* defines. We get
that at compile time because we pass -D_GNU_SOURCE through to
everything, but in the configure check we aren't setting _GNU_SOURCE, so
we don't find STATX_MNT_ID, and so don't set HAVE_STATX_MNT_ID.
(This was fine before ccf5a8a6fc, because linux/stat.h does not require
_GNU_SOURCE).
Simple fix: in the check, define _GNU_SOURCE before including
sys/stat.h.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18312
Currently, when there there are several faulted disks with attached
dRAID spares, and one of those disks is cleared from errors (zpool
clear), followed by its spare being detached, the data in all the
remaining spares that were attached while the cleared disk was in
FAULTED state might get corrupted (which can be seen by running scrub).
In some cases, when too many disks get cleared at a time, this can
result in data corruption/loss.
dRAID spare is a virtual device whose blocks are distributed among
other disks. Those disks can be also in FAULTED state with attached
spares on their own. When a disk gets sequentially resilvered (rebuilt),
the changes made by that resilvering won't get captured in the DTL
(Dirty Time Log) of other FAULTED disks with the attached spares to
which the data is written during the resilvering (as it would normally
be done for the changes made by the user if a new file is written or
some existing one is deleted). It is because sequential resilvering
works on the block level, without touching or looking into metadata,
so it doesn't know anything about the old BPs or transactions groups
that it is resilvering. So later on, when that disk gets cleared
from errors and healing resilvering is trying to sync all the data
from its spare onto it, all the changes made on its spare during the
resilvering of other disks will be missed because they won't be
captured in its DTL. That's why other dRAID spares may get corrupted.
Here's another way to explain it that might be helpful. Imagine a
scenario:
1. d1 fails and gets resilvered to some spare s1 - OK.
2. d2 fails and gets sequentially resilvered on draid spare s2. Now,
in some slices, s2 would map to d1, which is failed. But d1 has s1
spare attached, so the data from that resilvering goes to s1, but
not recorded in d1's DTL.
3. Now, d1 gets cleared and its s1 gets detached. All the changes
done by the user (writes or deletions) have their txgs captured
in d1's DTL, so they will be resilvered by the healing resilver
from its spare (s1) - that part works fine. But the data which
was written during resilvering of d2 and went to s1 - that one
will be missed from d1's DTL and won't get resilvered to it. So
here we are:
4. s2 under d2 is corrupted in the slices which map to d1, because
d1 doesn't have that data resilvered from s1.
Now, if there are more failed disks with draid spares attached which
were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and
so on - all their spares will be corrupted. Because, in some slices,
each of them will map to d1 which will miss their data.
Solution: add all known txgs starting from TXG_INITIAL to DTLs of
non-writable devices during sequential resilvering so when healing
resilver starts on disk clear, it would be able to check and heal
blocks from all txgs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18286Closes#18294
vdev_rebuild() is always called with spa_config_lock held in
RW_WRITER mode. However, when it tries to call dmu_tx_assign()
the latter may hang on dmu_tx_wait() waiting for available txg.
But that available txg may not happen because txg_sync takes
spa_config_lock in order to process the current txg. So we have
a deadlock case here:
- dmu_tx_assign() waits for txg holding spa_config_lock;
- txg_sync waits for spa_config_lock not progressing with txg.
Here are the stacks:
__schedule+0x24e/0x590
schedule+0x69/0x110
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
dmu_tx_wait+0x8e/0x1e0 [zfs]
dmu_tx_assign+0x49/0x80 [zfs]
vdev_rebuild_initiate+0x39/0xc0 [zfs]
vdev_rebuild+0x84/0x90 [zfs]
spa_vdev_attach+0x305/0x680 [zfs]
zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
spa_config_enter+0xf9/0x120 [zfs]
spa_sync+0x6d/0x5b0 [zfs]
txg_sync_thread+0x266/0x2f0 [zfs]
The solution is to pass txg returned by spa_vdev_enter(spa)
at the top of spa_vdev_attach() to vdev_rebuild() and call
dmu_tx_create_assigned(txg) which doesn't wait for txg.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18210Closes#18258
The autoconf checks are more than enough to decide whether or not we can
work with this kernel or not.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18295
When a namespace property is changed via zfs set, libzfs remounts the
filesystem to propagate the new VFS mount flags. The current approach
uses mount(2) with MS_REMOUNT, which reads all namespace properties
from ZFS and applies them together. This has two problems:
1. Linux VFS resets unspecified per-mount flags on remount. If an
administrator sets a temporary flag (e.g. mount -o remount,noatime),
a subsequent zfs set on any namespace property clobbers it.
2. Two concurrent zfs set operations on different namespace properties
can overwrite each other's mount flags.
Additionally, legacy datasets (mountpoint=legacy) were never remounted
on namespace property changes since zfs_is_mountable() returns false
for them.
Add zfs_mount_setattr() which uses mount_setattr(2) to selectively
update only the mount flags that correspond to the changed property.
For legacy datasets, /proc/mounts is iterated to update all
mountpoints. On kernels without mount_setattr (ENOSYS), non-legacy
datasets fall back to a full remount; legacy mounts are skipped to
avoid clobbering temporary flags.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18257
Checking for LD_VERSION in unreliable as not all distros define it on
the compiler's preprocessor.
Explicitly check it via autoconf.
This fixes support for Ubuntu 18.04 on arm64.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Closes#18262
This API has been available since kernel 5.2, and having it available
(almost) everywhere should give us a lot more flexibility for mount
management in the future.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18260
The traditional mount API has been removed, so detect when its not
available and instead use a small adapter to allow our existing mount
functions to keep working.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
Kernel devs noted that almost all callers to posix_acl_to_xattr() would
check the ACL value size and allocate a buffer before make the call. To
reduce the repetition, they've changed it to allocate this buffer
internally and return it.
Unfortunately that's not true for us; most of our calls are from
xattr_handler->get() to convert a stored ACL to an xattr, and that call
provides a buffer. For now we have no other option, so this commit
detects the new version and wraps to copy the value back into the
provided buffer and then free it.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
It does exactly the same thing, just inverts the return. Detect its
presence or absence and call the right one.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
On systems where `$kernelsrc` is different than `$kernelbuild`, the
objtool binary will be located in `$kernelbuild` as it's the result of
running `make prepare` during kernel build.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Louis Leseur <louis.leseur@gmail.com>
Closes#18248Closes#18249
The upcoming 7.0 kernel will no longer fall back to generic_setlease(),
instead returning EINVAL if .setlease is NULL. So, we set it explicitly.
To ensure that we catch any future kernel change, adds a sanity test for
F_SETLEASE and F_GETLEASE too. Since this is a Linux-specific test,
also a small adjustment to the test runner to allow OS-specific helper
programs.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18215
Currently, spa_dspace (base to calculate dataset AVAIL) only includes
the normal allocation class capacity, but dd_used_bytes tracks space
allocated across all classes. Since we don't want to report free
space of other classes as available (we can't promise new allocations
will be able to use it), report only allocated space, similar to how
we report space saved by dedup and block cloning.
Since we need deflated space here, make allocation classes track
deflated allocated space also. While here, make mc_deferred also
deflated, matching its use contexts. Also while there, use
atomic_load() to read the allocation class stats.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18190Closes#18222
ZFS can be built directly into the Linux kernel. Add a test build
of this to the CI to verify it works. The test build is only enabled
on Fedora runners (since they run the newest kernels) and is done in
parallel with ZTS. The test build is done on vm2, since it typically
finishes ~15min before vm1 and thus has time to spare.
In addition:
- Update 'copy-builtin' to check that $1 is a directory
- Fix some VERIFYs that were causing the built-in build to fail
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18234
Linux 6.19 added an AES-GCM VAES-AVX2 assembly implementation. It's
basically a translation from the BoringSSL perlasm syntax to macro
assembly. We're using the same source but the perlasm generated flat
assembly which shares some global function names with the former.
When building in-tree this results in the linker failing due to the
duplicate symbols.
To avoid the error we prepend `icp_` via a macro to our function
names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Moch <mail@alexmoch.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#18204Closes#18224
Without this patch, the following crash can occur when
a file system is configured with "xattr=dir".
VNASSERT failed: locked not true at
/posix-acl/freebsd-rdma/sys/kern/vfs_subr.c:5786 (assert_vop_locked)
hold count flags ()
flags ()
lock type zfs: UNLOCKED
panic: zfs_dirent_lookup: vnode is not locked but should be
cpuid = 3
time = 1770520763
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b
vpanic() at vpanic+0x136/frame 0xfffffe00914c8270
panic() at panic+0x43/frame 0xfffffe00914c82d0
assert_vop_locked() at assert_vop_locked+0x78
zfs_dirent_lookup() at zfs_dirent_lookup+0x41
zfs_setattr_dir() at zfs_setattr_dir+0x123
zfs_setattr() at zfs_setattr+0x1389
zfs_freebsd_setattr() at zfs_freebsd_setattr+0x56b
VOP_SETATTR_APV() at VOP_SETATTR_APV+0x5d
setfown() at setfown+0xb1
kern_fchownat() at kern_fchownat+0x192
This patch fixes the problem by moving the vput() call for
attrzp to after the zfs_setattr_dir() call that takes it as
an argument.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes: #18188
This ensures that the in-memory state of the feature is recorded and
that `dsl_dataset_activate_feature` is not called when the feature
is already active.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18143Closes#18144
To avoid read errors with transaction open dmu_tx_check_ioerr()
is used to read everything required in advance. But there seems
to be a chance for the buffer to evicted from dbuf cache in
between, which result in immediate eviction from ARC, which may
require additional disk read later in a place where error handling
is problematic.
To partially workaround this introduce a new flag DMU_IS_PREFETCH,
relayed to ARC as ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH,
making ARC delay eviction by at least several seconds, or till the
actual read inside the transaction, that will promote it to demand
access.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18160
This change modifies the behavior of spa_sync_time_logger when
flushing the RRD database.
Previously, once the sync interval elapsed, a flush would always
be generated. On solid-state devices, especially when the pool was
otherwise idle, this caused disks to wake up solely to write RRD
data. Since RRD is best-effort telemetry, this behavior is
unnecessary and wasteful.
With this change, spa_sync_time_logger delays flushing until a TXG
that already contains data is being synced. The RRD update is
appended to that TXG instead of forcing the creation of
a new write-only TXG.
During pool export, flushing is forced regardless of whether
the TXG contains user data. At that stage, data durability takes
precedence and a write must be issued.
Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes#18082Closes#18138
When performing an incremental raw send with intermediates (-w -I),
the standard 'send' permission was incorrectly required instead of
allowing 'send:raw'. This was due to a strict boolean comparison on
the 'rawok' flag in zfs_secpolicy_send() with non-boolean value.
This change normalizes the 'rawok' variable to be strictly 0/1 and
updates the test suite to properly verify delegated raw send behavior.
Introduced-by: https://github.com/openzfs/zfs/pull/17543
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Marc Sladek <marc@sladek.dev>
Closes#18198Closes#18193
Wait for scrub_finish (as the comments in the code suggest) rather
than trim_finish in zed_synchronous_zedlet.ksh. This seems to
workaround the ZTS failures in #18192. Also, fix some typos.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18192Closes#18196
The Lustre filessytem calls a number of exported ZFS functions. Do a
test build on the Almalinux runners to make sure we're not breaking
Lustre. We do the Lustre build in parallel with the normal ZTS test
for efficiency, since ZTS isn't very CPU intensive. The full Lustre
build takes around 15min when run on its own.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18161
Because the `strerror` result doesn't include a newline, we need to add
one. Observed on a minimal system that doesn't have `man` installed,
which behaves like this before the fix:
```
[root@upper tim]# zpool help import
couldn't run man program: No such file or directory[root@upper tim]#
```
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Hatch <tim@timhatch.com>
Closes#18183
- mmp_concurrent_import: added test case to verify that concurrent
import correctness. The pool may only be imported once.
- mmp_exported_import: an activity check is now required for pools
which were cleanly exported if the system and pool hostids don't
match.
- mmp_inactive_import: an activity check is now required for any
pool which wasn't cleanly exported, even if the system and pool
hostids match.
- mmp_on_uberblocks: updated expected uberblocks to take in to account
the value MMP_INTERVAL_DEFAULT is set too.
- mmp_reset_interval: reduce the number of iterations from 10 to 3.
This is sufficient to verify functionality and significantly speeds
up the test.
- mmp_on_uberblocks: adjust the thresholds and increase the runtime
to avoid false positives observed in CI.
- Update tests to use 'zhack action idle' instead of ztest to improve
the reliability of the tests.
- Add additional log_note messages to test cases which have multiple
verification steps to make it clear which portion of a test failed
when reviewing the logs.
- Replace default_setup/cleanup_noexit calls with 'zpool create' and
'zpool destroy' calls to avoid additional unnecessary dataset
creation work.
- Update activity/noactivity check helper functions to use the
ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
to determine if this activity check ran and why. This is more
reliable in the CI than measuring the runtime.
- Removed all mmp tests from the zts-report.py exceptions list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
In order to reliably test the multihost protection we need two (or more)
systems attempting to import the pool at the same time. Historically, we've
used ztest running in userspace to simulate an active pool and attempted to
import the pool with the kernel modules. This works but ztest is a bit
unwieldy for this and if it crashes for unrelated reasons it can result
in false positives.
All we really need is the pool imported in userspace so the MMP thread is
active and writing out uberblocks. We can extend zhack which already knows
how to import the pool read/write and add an option to leave the pool open
and idle.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Add a -G option to zhack to dump the internal debug buffer on exit.
We were able to use the same code from zdb for this which was nice.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts. This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect. This extra
safety chceck operates as follows:
1. Repeats the following MMP check 10 times:
a. Write out an MMP uberblock with the best txg and a random
sequence id to all primary pool vdevs.
b. Verify a minimum number of good writes such that even if
the pool appears degraded on the remote host it will see
at least one of the updated MMP uberblocks.
c. Wait for the MMP interval this leaves a window for other
racing hosts to make similar modifications which can be
detected.
d. Call vdev_uberblock_load() to determine the best uberblock
to use, this should be the MMP uberblock just written.
e. Verify the txg and random sequeunce number match the MMP
uberblock written in 1a.
2. Restore the original MMP uberblocks. This allows the check
to be performed again if the pool fails to import for an
unrelated reason.
This change also includes some refactoring and minor improvements.
- Never try loading earlier txgs during import when the import
fails with EREMOTEIO or EINTER. These errors don't indicate
the txg is damaged but instead that its either in use on a
remote host or the import was interactively cancelled. No
rewind is also performed for EBADD which can result from a
stale trusted config when doing a verbatim import.
- Refactor the code for consistent logging of the multihost
activity check using spa_load_note() and console messages
indicating when the activity check was trigger and the result.
- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
modification of the sequence number in an uberblock.
- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
set to log to dump to stdout the spa_load_info nvlist returned
during import. This is used by the updated mmp test cases
to determine if an activity check was run and its result.
- Standardize the mmp messages similarly to make it easier to
find all the relevent mmp lines in the debug log.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Tryimport adds a unique prefix to the pool name to avoid name
collisions. This makes it awkward to log user-friendly info
during a tryimport. Add a spa_load_name() function which can
be used to report the unmodified pool name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check. Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
The macro 'flush_dcache_page(...)' modifies the page flags, but in Linux
6.18 the type of the page flags changed from 'unsigned long' to the
struct type 'memdesc_flags_t' with a single member 'f' which is the page
flags field.
Signed-off-by: Erik Larsson <catacombae@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Linux upstream commit 56754f0f46f6: "objtool: Rename
--Werror to --werror" did just that, so we should check for
either "--Werror" or "--werror", else the build will fail
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: John Cabaj <john.cabaj@canonical.com>
Closes#18152
Add an Alpine Linux 3.23 runner to the CI chain to run OpenZFS builds
and tests against musl libc.
Currently, zfs_send_sparse is killed after 10 minutes on Alpine, causing
cascading EBUSY failures in the test suite. With zfs_send_sparse
disabled, the ZFS test suite reaches a pass rate of 94.62%.
This commit introduces the required Alpine-specific setup and a small
set of shell and cloud-init compatibility fixes that also apply to
existing Linux runners.
The Alpine runner is not enabled by default and is not executed for new
pull requests.
Sponsored-by: ERNW Research GmbH - https://ernw-research.de/
Signed-off-by: Alexander Moch <amoch@ernw.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Calling realpath(path, buf) can trigger fortified header wrappers that
allocate a PATH_MAX-sized temporary buffer on the stack, exceeding the
4 KiB frame limit on some systems. Use the heap-allocating
realpath(path, NULL) form instead.
Sponsored-by: ERNW Research GmbH - https://ernw-research.de/
Signed-off-by: Alexander Moch <amoch@ernw.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Not sure why this was not caught by CI; perhaps my shellcheck is new
enough to catch more things.
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
The qemu-test-repo-vm.sh script tests installs ZFS from different
repos. Have it test from the new 2.4.x repos as well.
Also add a checkbox to run in "lookup mode". This just does a
quick lookup to see what version is installed in each repo. It does
not do a test install and module load. It only takes 3min to run vs
over an hour for the full version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18070
Newer versions of `shellcheck` and `checkbashism` finds more than
previous, so fix those.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
The `type` command is an optional feature in POSIX, so shouldn't be
used.
Instead, use `command -v`, which commit
e865e7809e
did, but it missed this file.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
There's no real documenation (which should probably be written!),
so instead document the code the best we can on what's going and
with the mounting of file systems to make future updates easier.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
More code standard changes, where if/then is on different lines.
To have it on the same, or on different lines, can be argued, but
we need to pick one, and try not to mix how to do things.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
The `ZFS_INITRD_ADDITIONAL_DATASETS` variable is used in the initrd
script to boot additional OS file systems besides the root file system.
But it wasn't included as an example in the config files.
The `ZFS_POOL_EXCEPTIONS` *was* included in the example defaults file,
but it was not exported, so not available in the initrd.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
The file `/etc/default/zfs` is already sourced by the `/etc/zfs/zfs-functions`,
so no need to source it again.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
When a pool is degraded, or needs special action, the `zpool import`
(without pool to import) line will report:
```
pool: rpool
id: 01234567890123456789
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
[..]
```
If the import with the pool name fails, it is supposed to try importing
using the pool ID.
However, the script is also getting the `action` line (and probably `scrub:`
if/when that's available):
pool; The pool can be imported using its name or numeric identifier.;config:;
which causes issues on consequent import attempts.
Cleanup the information by rewriting the `sed` command line.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
It's considered good practice to:
1) Wrap the variable name in `{}`.
As in `${variable}` instead of `$variable`.
2) Put variables in `"`.
Also some minor error message tuning.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
In a previous commit (e865e7809e), the
`local` keyword was removed in functions because of bashism.
Removing bashisms is correct, however this could cause variable overwrites,
since several functions use the same variable name.
So this commit make function variables unique in the (now) global name
space.
The problem from the original bug report (see #17963) could not be duplicated,
but it is still sane to make sure that variables stay unique.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#18000
Right now, the -v and -o options for `zpool list` work independently,
but when paired, the -v "wins out" and the -o effect is lost. This
commit fixes that problem.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com>
Closes#11040Closes#17839
- For whatever reason, the runner will now startup with either two 75GB
disks or one 150GB disk. Previously the runner was always booting
with two 75GB, but about a quarter of the time it now starts up
with a single 150GB disk. This caused qemu-1-setup.sh to fail
since it expected the two 75GB disks. This commit updates
qemu-1-setup.sh to work with either disk config.
- Remove the watchdog from qemu-1-setup.sh. It didn't turn out to be
useful.
- Remove the timestamps that zfs-qemu.yml added to the qemu-1-setup.sh
output. The timestamps were redundant, since you can already
download timestamped logs from the Github web interface.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18166
The final txgs are used only to clear out any remaining deferred
frees, and we cannot write new data to them. Make sure we do not
try to do so.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes#18139
* Lock db_mtx around arc_release() in dbuf_release_bp()
While this function is called only in sync context, the same buffer
can be touched by dbuf_hold_impl() in open context, creating races.
All other accesses to arc_release() are already protected by db_mtx,
so just take it here too.
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
* Lock db_mtx in sa_byteswap()
While SA code seems protected by sa_lock, there is a back door of
dmu_objset_userquota_get_ids(), that may hold and access the dbuf
without sa_lock, relying only on db_mtx. Taking db_mtx here should
protect both the arc_release() and the data for db_buf.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18146
This option is removed upstream in favour of plain INVARIANTS.
VNASSERT is always defined so I see no reason to use it conditionally.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18136
The make symbols were never getting forwarded to the correct make
subprocess. As far as I can tell, this has never worked. Either that,
or something has changed in the behavior of make.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes#18131
`zpool create` is supposed to log the command to the new pool’s history,
as a special record that never gets evicted from the ring buffer. but
when you create a pool with `zpool create -t`, no such record is ever
logged (#18102). that bug may be the cause of issues like #16408.
`zpool create -t` (83e9986f6e) and `zpool
import -t` (26b42f3f9d) are both designed
to override the on-disk zpool property `name` with an in-core
“temporary” name, but they work somewhat differently under the hood.
importing with a temporary name sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` in ZFS_IOC_POOL_IMPORT, which tells
spa_write_cachefile() and spa_config_generate() to use the
ZPOOL_CONFIG_POOL_NAME in `spa->spa_config` instead of `spa->spa_name`.
creating with a temporary name permanently(!) sets the internal zpool
property `tname` (ZPOOL_PROP_TNAME) in the `zc->zc_nvlist_src` of
ZFS_IOC_POOL_CREATE, which tells zfs_ioc_pool_create()
(4ceb8dd6fd) and spa_create() to use that
name instead of `zc->zc_name`, then sets `spa->spa_import_flags |=
ZFS_IMPORT_TEMP_NAME` like an import.
but zfsdev_ioctl_common() fails to check for `tname` when saving the
pool name to `zfs_allow_log_key`, so when we call ZFS_IOC_LOG_HISTORY,
we call spa_open() on the wrong pool name and get ENOENT, so the logging
silently fails.
this patch fixes#18102 by checking for `tname` in zfsdev_ioctl_common()
like we do in zfs_ioc_pool_create().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes#18118Closes#18102
Similar to BRT, DDT ZAP can be destroyed by sync context when it
becomes empty. Respectively similar to BRT introduce RW-lock to
protect open context methods from the destruction.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18115
This commit adds support for converting a file handle to its
parent dentry. This is called in exportfs_decode_fh_raw()
when subtree checking is enabled in NFS. Defining this and
handling the expanded filehandles allows the knfsd to succeed
in handling the file handle where it might otherwise fail
with ESTALE when trying to open by filehandle.
A side effect of this change is that name_to_handle_at(2)
and open_by_handle_at(2) now support AT_HANDLE_CONNECTABLE.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Andrew Walker <andrew.walker@truenas.com>
Closes#18099
Long ago, SPL atomics were implemented as a global spinlock over
conventional operations. In 5e9b5d832b (2009-10) they was converted to
proper atomics, with the spinlock retained as a fallback.
The switch to compile with the fallback was later removed in a91258913f
(2018-05), but the code it enabled wasn't. So lets do that.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18117
On FreeBSD, linking the zfs kernel module with binutils ld 2.44 shows
the following warning:
ld: warning: aesni-gcm-avx2-vaes.o: missing .note.GNU-stack section
implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a
future version of the linker
Some of the `.S` files under `module/icp/asm-x86_64/modes` check whether
to emit the `.note.GNU-stack` section using:
#if defined(__linux__) && defined(__ELF__)
We could add `&& defined(__FreeBSD__)` to the test, but since all other
`.S` files in the OpenZFS tree use:
#ifdef __ELF__
it would seem more logical to use that instead. Any recent ELF platform
should support these note sections by now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes#18119
ZFS send streams include a feature flag DMU_BACKUP_FEATURE_LARGE_BLOCKS
to indicate the presence of large blocks in the dataset. On the sending
side, this flag is included if the `-L` flag is passed to `zfs send`
and the feature is active in the dataset. On the receive side, the
stream is refused if the feature is active in the destination dataset
but the stream does not include the feature flag.
The problem is the feature is only activated when a large block is
born. If a large block has been born in the destination, but never
the source, the send can't work. This can arise when sending streams
back and forth between two datasets.
This commit fixes the problem by always activating the large blocks
feature when receiving a stream with the large block feature flag.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Austin Wise <AustinWise@gmail.com>
Closes#18105
In #17180, we fixed an interesting bug that i believe i hit in one of my
pools, but as far as i can tell, there was no test for it.
this patch adds a regression test for #17180, minimised from my attempts
to reproduce the bug in a way that resembled the history of my pool.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: delan azabani <dazabani@igalia.com>
Closes#18109
For kernel builds on FreeBSD, we redefine `__printf__` to
`__freebsd_kprintf__`, to support FreeBSD kernel printf(9) extensions
with clang.
In OpenZFS various printf related functions are declared with
`__attribute__((format(printf, X, Y)))`, so these won't work with the
above redefinition. With clang 21 and higher, this leads to errors
similar to:
sys/contrib/openzfs/module/zfs/spa_misc.c:414:38: error: passing
'printf' format string where 'freebsd_kprintf' format string is
expected [-Werror,-Wformat]
414 | (void) vsnprintf(buf, sizeof (buf), fmt, adx);
| ^
Since attribute names can always be spelled with leading and trailing
double underscores, rename these instances.
Note that in the FreeBSD base system we usually use `__printflike` from
`<sys/cdefs.h>`, but that does not apply to OpenZFS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes#18095
This commit adds handling for the STATX_CHANGE_COOKIE so that
we can properly surface the ZFS znode sequence to NFS clients via
knfsd.
If knfsd does not have STATX_CHANGE_COOKIE in statx result then
it will synthesize the NFS change_info4 structure and related
change4id values algorithmically based on the ctime value of the
file. Since internally ZFS is using ktime_get_coarse_real_ts64()
for the timestamp calculation here it introduces the possiblity
that the change will not increment the change4id of directories
/ files causing a failure in the client to invalidate its attr
cache (among other things). See RFC 8881 Section 10.8 for
discussion of how clients may implement name and directory
caching.
Notable in this commit is that we are not initializing the
inode->i_version to the znode->z_seq number. The reason for this
is that we're intentionally not setting `SB_I_VERSION`. This
indicates that the filesystem manages its own i_version and
so it is not populated in the generic_fillattr.
The following compares tight loop of setattr over NFSv4
protocol while traching nfsd4_change_attribute.
Before change:
inode, change_attribute
4723, 7590032215978780890
4723, 7590032215978780890
4723, 7590032215978780890
4723, 7590032215982780865
4723, 7590032215982780865
After change:
inode, change_attribute
7602, 7590032992517123951
7602, 7590032992517123952
7602, 7590032992517123953
7602, 7590032992517123954
7602, 7590032992517123955
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Andrew Walker <andrew.walker@truenas.com>
Closes#18097
Scan time limits do not need precision beyond 1ms. Switching
scn_sync_start_time and spa_sync_starttime from gethrtime() to
getlrtime() saves ~3% of CPU time during resilver scan stage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18061
With higher throughput and lower latency of modern devices ZFS can
happily live with pretty short (fractions of a second) TXGs. But
the two decade old multi-second minimal time limits can almost stop
payload writes by extending TXGs beyond dirty data limits of ARC
ability to amortize it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18060
"zdb -r -O pool/dataset obj-id destination" will copy
the file with object-id obj-id to the named destination;
without -O it'll still be interpreted as a pathname.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes#16307
If the file already has more than one block, then the current
block size cannot change. But if the file block size is less
than the maximum block size supported by the file system, and
there are multiple blocks in the file, the current code will
almost always extend the rangelock to its maximum size.
This means that all writes become serialized and even reads
are slowed as they will more often contend with writes. This
commit adjusts the test so that we will not lock the entire
range if there is more than one block in the file already.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Maybee <mark.maybee@perforce.com>
Closes#18046Closes#18064
There were some per I/O logging into dbgmsg in RAIDZ code, that
increased CPU load and wiped useful content out of dbgmsg, for
example during routine disk replacement process. I don't think
we need it to be that verbose.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18059
The first byte of the entry after compression is used for algorithm
and byte order flag. We should decrement when calling compression/
decompression algorithm.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18055
Unlike other ZAP consumers due to compression DDT does not know
how big entry it is reading from ZAP. Due to this it called
zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(),
each of which does full ZAP entry lookup.
Introduction of the combined ZAP method dramatically reduces the
CPU overhead and locks contention at DBUF layer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18048
As was previously done for BRT, avoid holding/releasing DDT ZAP
dnodes for every access. Instead hold the dnodes during all their
life time, never releasing.
While at this, add _by_dnode() interfaces for zap_length_uint64()
and zap_count(), actively used by DDT code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18047
Postponing entry removal from the DDT log in case of hit till later
single-threaded sync stage allows to make ddl_tree stable during
multi-threaded ZIO processing stage. It allows to drop the DDT lock
before the search instead of after, reducing the contention a lot.
Actually ddt_log_update_entry() was already handling the case of
entry present in the active log, so we only need to remove it from
flushing log, if the entry happen to be there.
My tests with parallel 4KB block writes show throughput increase
from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even
though still limited by the global DDT lock contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18044
Previous code effectively enforced that all async free ZIOs were
_issued_ within the TXG timeout. But they could take forever to
complete, especially if the required metadata were not in ARC.
This patch introduces periodic waits every 2000 ZIOs, which should
give at least somewhat reasonable TXG timings even for single HDD
pools with empty ARC. And makes them complete within half of the
TXG timeout, since we might still need time to sync DDT and BRT.
While there, change zfs_max_async_dedup_frees semantics to include
also clone and gang blocks, which are similar. Bump the default
value from set long ago to be more forgiving to block cloning
(still not having logs and benefiting from large TXGs), now that
we have better working time limits. The limit now is a possible
amount of dirty data produced by BRT updates.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18043
We've observed a number of cases when pool import stuck for many
minutes due to large async destroy trying to load DDT or BRT from
HDD pool. While proper destroy dosage is a separate problem,
lets give import process a chance to complete before that at all.
It may be not enough if there is a lot of ZIL to replay, but that
is harder to cover, since those are in separate syscalls.
Code investigation shown that we already have this mechanism used
for scrub/resilver, so this patch converts SCAN_IMPORT_WAIT_TXGS
into a tunable and applies it to async destroys also.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18033
Instead of comparing number of SLOG writes to number of normal
writes we should just make sure SLOG got the required number of
writes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18033
ddt_lookup() in zio_ddt_write() might require synchronous DDT ZAP
read. Running it from interrupt taskq might lead to deadlock.
Inclusion of ZIO_STAGE_DDT_WRITE into ZIO_BLOCKING_STAGES should
hopefully fix that, even though I am not sure how I got there.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17981
Before parallel eviction implementation zfs_arc_evict_batch_limit
caused loop exits after evicting 10 headers. The cost of it is not
big and well motivated. Now though taskq task exit after the same
10 headers is much more expensive. To cover the context switch
overhead of taskq introduce another level of batching, controlled
by zfs_arc_evict_batches_limit tunable, used only for parallel
eviction.
My tests including 36 parallel reads with 4KB recordsize that shown
1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention,
now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17970
There is no need to do MSEC_TO_TICK() for each evicted ARC header.
We can do it when tunables are set, since we already have separate
internal variables for those.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17965
For each block written or freed ZFS dirties ds_dbuf of the dataset.
While dbuf_dirty() has a fast path for already dirty dbufs, it still
require taking the lock and doing some things visible in profiler.
Investigation shown ds_dbuf dirtying by dsl_dataset_block_born()
and some of dsl_dataset_block_kill() are just not needed, since
by the time they are called in sync context the ds_dbuf is already
dirtied by dsl_dataset_sync().
Tests show this reducing large file deletion time by ~3% by saving
CPU time of single-threaded part of the sync thread.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18028
- Increase qemu-1-setup.sh timeout to 20min since it sometimes
fails to complete after 15min.
- Timestamp all qemu-1-setup.sh lines to look for hangs.
- Add a 'watchdog' process to print out the top running process every
30sec to help with debugging.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17714
The 'Setup QEMU' CI step updates and installs all packages necessary to
startup QEMU. Typically the step takes a little over a minute, but
we've seen cases where it can take legitimately take more than 45min
minutes. Change the timeout to 60 minutes.
In addition, change the 'Install dependencies' timeout to 60min since
we've also seen timeouts there.
Lastly, remove all timeouts from the zfs-qemu-packages workflow.
We do this so that we can always build packages from a branch, even if
the time it takes to do a CI step changes over time. It's ok to
eliminate the timeouts from the zfs-qemu-packages completely since that
workflow is only run manually.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18056
Add snapshot_019_pos to verify parallel snapshot automount operations
don't cause AVL tree panic. Regression test for commit 4ce030e025.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18035
Multiple threads racing to automount the same snapshot can both spawn
mount helper processes that successfully complete, causing both parent
threads to attempt AVL tree registration and triggering a VERIFY()
panic in avl_add(). This occurs because the fsconfig/fsmount API lacks
the serialization provided by traditional mount() via lock_mount().
The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that
serializes mount and unmount operations on the same snapshot. The first
mount thread creates a pending entry with se_spa=NULL and holds se_mtx
during the helper execution. Concurrent mounts find the pending entry
and return success without spawning duplicate helpers. Unmount waits on
se_mtx if a mount is pending, ensuring proper serialization. This allows
different snapshots to mount in parallel while preventing the AVL panic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17943
A deadlock occurs when snapshot expiry tasks are cancelled while holding
locks. The snapshot expiry task (snapentry_expire) spawns an umount
process and waits for it to complete. Concurrently, ARC memory pressure
triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the
expiry task while holding locks. The umount process spawned by the
expiry task blocks trying to acquire locks held by arc_prune, which is
blocked waiting for the expiry task to complete. This creates a circular
dependency: expiry task waits for umount, umount waits for arc_prune,
arc_prune waits for expiry task.
Fix by adding non-blocking cancellation support to taskq_cancel_id().
The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to
reschedule the unmount, which needs to cancel any existing expiry task.
It now uses non-blocking cancellation to avoid waiting while holding
locks, breaking the deadlock by returning immediately when the task is
already running.
The per-entry se_taskqid_lock has been removed, with all taskqid
operations now protected by the global zfs_snapshot_lock held as
WRITER. Additionally, an se_in_umount flag prevents recursive waits when
zfsctl_destroy() is called during unmount. The taskqid is now only
cleared by the caller on successful cancellation; running tasks clear
their own taskqid upon completion.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17941
Remove unsafe timer_pending() check in taskq_cancel_id() that created a
race where:
- Timer expires and timer_pending() returns FALSE
- task_done() frees task with tqent_func = NULL
- Timer callback executes and queues freed task
- Worker thread crashes executing NULL function
Always call timer_delete_sync() unconditionally to ensure timer callback
completes before task is freed.
Reliably reproducible by injecting mdelay(10) after setting CANCEL flag
to widen the race window, combined with frequent task cancellations
(e.g., snapshot automount expiry).
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17942
We may not be able to avoid our code referencing the symbol, but we can
ensure that a symbol of that name is available to the linker during
build, and so not require linking the GPL-exported version.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#18009Closes#18040
For whatever reason, the single `log_note` in the `directory_diff`
function causes the function to stop executing on Ubuntu 22. This
causes most of the rsend tests to fail. Remove the line since it's only
informational.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Compilation time bug introduced by 87df5e4 commit.
Fix for the compilation error(Linux kernel 6.18.0):
"zfs/module/os/linux/zfs/abd_os.c:920:32: error: implicit declaration
of function ‘nth_page’; did you mean ‘pte_page’?
[-Werror=implicit-function-declaration]".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: agiUnderground <alex.dev.cv@gmail.com>
Closes#18034
Allow an author or reviewer's name and email address to exceed
the 72 character limit enforced by the commitcheck target.
Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18030
Fix another instance where ZFS assumes multiple pages can be
mapped at once via zfs_kmap_local(), resulting in crashes and
potential memory corruption on HIGHMEM-enabled (typically 32-bit)
systems.
Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes#15668Closes#18030
ZFS typically preserves proper LIFO ordering regarding map/unmap
operations that wrap the Linux kernel's kmap interfaces that
require such ordering, but one instance in abd_raidz_gen_iterate()
did not.
Similar issues have been fixed in the Linux kernel in the past,
see for instance CVE-2025-39899 for userfaultfd.
Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes#15668Closes#18030
HIGHMEM kmap interfaces operate on only a single page at a time
yet ZFS hadn't accounted for this, resulting in crashes and
potential memory corruption on HIGHMEM (typically 32-bit) systems.
This was caught by PaX's KERNSEAL feature as it makes use of
HIGHMEM functionality on x64.
On typical 64-bit systems, this issue wouldn't have been observed,
as the map interfaces simply fall back to returning an address in
lowmem where the contiguous pages can be accessed directly.
Joint work with the PaX Team, tested by Mark van Dijk
Reviewed-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com>
Closes#15668Closes#18030
In general it's possible for a vnode to not have an associated VM
object. This happens in particular with named pipes, which have
some distinct VOPs, defined in zfs_fifoops. Thus, this chunk of
zfs_freebsd_fsync() needs to check for the FIFO case, like other
vm_object_mightbedirty() callers do.
(Note that vn_flush_cached_data() calls are predicated on
zn_has_cached_data() returning true, and it checks for a NULL v_object
pointer already.)
Fixes: ef4058fcdc
Reported-by: Collin Funk <collin.funk1@gmail.com>
Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#18015
While not common the draid3 vdev type has been observed to
not always sit out a vdev when run in the CI. To prevent
continued false positives allow the test to be retried up
to three times before considering it a failure.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18003
These macros are deprecated in FreeBSD kernel for several years,
and unneeded for much longer. Instead, similar to Linux, let
kernel let compiler do the right things.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18004
It feels dirty to modify protection of a memory allocated via libc,
but at least we should try to restore it before freeing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17977
- When filling ABDs of several segments, consider offset.
- "Corrupt" ABDs with actually different data to fail something.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17977
Before this change DDT lock was taken 4 times per written block,
and as effectively a pool-wide lock it can be highly congested.
This change introduces a new per-entry dde_io_lock, protecting some
fields during I/O ready and done stages, so that we don't need the
global lock there.
According to my write tests on 64-thread system with 4KB blocks this
significantly reduce the global lock contention, reducing CPU usage
from 100% to expected ~80%, and increasing write throughput by 10%.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17960
Even though unlike gang children it is not so critical for dedup
children to inherit parent's allocator, there is still no reason
for them to have allocation policy different from normal writes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17961
Test install from our new repos: zfs-latest, zfs-legacy,
zfs-2.3, zfs-2.2, from the zfs-test-packages workflow.
This on-demand workflow is use to verify that the zfs RPMs
in the repos are correct.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17956
6.18 changes kmap_atomic() to take a const pointer. This is no problem
for the places we use it, but Clang fails the test due to a warning
about being unable to guarantee that uninitialised data will definitely
not change. Easily solved by forcibly initialising it.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17954
Using strlen() in an static array declaration is a GCC extension. Clang
calls it "gnu-folding-constant" and warns about it, which breaks the
build. If it were widespread we could just turn off the warning, but
since there's only one case, lets just change the array to an explicit
size.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17954
Linux switched from -std=gnu89 to -std=gnu11 in 5.18
(torvalds/linux@e8c07082a8). We've always overridden that with gnu99
because we use some newer features.
More recent kernels are using C11 features in headers that we include.
GCC generally doesn't seem to care, but more recent versions of Clang
seem to be enforcing our gnu99 override more strictly, which breaks the
build in some configurations.
Just bumping our "override" to match the kernel seems to be the easiest
workaround. It's an effective no-op since 5.18, while still allowing us
to build on older kernels.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#17954
On FreeBSD errno is defined as (* __error()), which means compiler
can't say whether two consecutive reads will return the same.
And without this knowledge the reported error is formally right.
Caching of the errno in local variable fixes the issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17975
Smatch is an actively maintained kernel-aware static analyzer
for C with a low false positive rate. Since the code checker
can be run relatively quickly against the entire OpenZFS code
base (15 min) it makes sense to add it as a GitHub Actions
workflow. Today smatch reports a significant numbers warnings
so the workflow is configured to always pass as long as the
analysis was run. The results are available for reference.
Long term it would ideal to resolve all of the errors/warnings
at which point the workflow can be updated to fail when new
problems are detected.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Toomas Soome <tsoome@me.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17935
add missing headers.
usage() is no-return, so anything after call to it is unreachable code.
use (void) cast where we do ignore return value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#17885
Refresh all ABI files using the CI generated files to reflect
the library interfaces to be published for the 2.4 release.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17911
The zfs_tunable_* functions are a public interface which are
part of the internal libspl convenience library. They should
be hidden to prevent an unnecessary ABI change in installed
libraries which link against libspl (e.g. libzfs_core, libuutil).
We do already leak long standing libspl symbols. This commit is
solely intended to prevent leaking these new ones until this is
properly sorted out.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17911
The ABI of libzfs and libzpool have breaking changes since the
last major release. Bump the SONAME for the upcoming 2.4 release
branch to libzfs7 and libzpool7.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17911
The nvlist_snprintf() function was added to the ABI of libnvpair.
No other symbols were modified or removed. Bump the library-info
SONAME current and age args to reflect this is a minor library
version update.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17911
Detect container environments and set timeout to zero unless
ZFS_MODULE_TIMEOUT is already set. This avoids an unnecessary ten
second delay after running zfs/zpool commands in a container where
/dev/zfs is unavailable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com>
Closes#15165Closes#17922
Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that
allows users to disable notifications for slow devices.
This prevents ZED and/or ZFSD from degrading the pool due to slow
I/O.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Closes 17477
Implement BRT (Block Reference Table) prefetch functionality similar
to existing DDT prefetch. This allows preloading BRT metadata into
ARC to improve performance for block cloning operations and frees
of earlier cloned blocks.
Make -t parameter optional. When omitted, prefetch all supported
metadata types (both DDT and BRT now).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17890
According to my observations, BRT ZAPs are typically compressible
3:1 for data and 2:1 for indirects. With ashift=12, typical these
days, it means increasing the block sizes to 8KB we may get most
of possible compression, reducing on-disk and in-ARC BRT footprint
in half by the cost of some compression/decompression overhead,
but without real write inflation, only some dirty data increase.
Increase to 32KB similar to DDT could further increase compression
and storage efficiency, but at the cost of write inflation and
much bigger dirty data increase, which we can not properly control
now. So lets leave this for a time when BRT log gets implemented.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17916
dmu_object_info_from_dnode() takes two locks and copies plenty of
data that we don't need in zap_lockdir_impl(). Just read dn_type
directly in this hot path.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17921
This is useful as debugging support, as it lets namespace lock
operations be traced directly. It will also be useful for future work to
reduce the use of spa_namespace_lock, traditionally a source of
difficult deadlocks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17906
BRT_RANGESIZE_TO_NBLOCKS() takes number of ranges as its argument.
To get number of blocks we should multiply it by the entry size,
not divide by it, as it was due to missing parentheses.
Before #17875 this could cause small memory corruptions for vdevs
bigger than 64TB, but the change made the bug more noticeable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17886Closes#17915
Update description of zpool import --rewind-to-checkpoint in
man/man7/zpoolconcepts.7 to explain that rewinding automatically
discards a checkpoint.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com>
Closes#12646Closes#17918
Free issue threads might block waiting for synchronous DDT, BRT or
GANG header reads. So unlike other taskqs using ZTI_SCALE to scale
with number of CPUs, here we also need some amount of threads to
potentially saturate pool reads. I am not sure we always want the
96 threads we had before ZTI_SCALE introduction at #11966 on small
systems, but lets make it at least 32.
While here, make free taskqs configurable, similar to read and
write ones.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17903
FreeBSD now has a pathconf name called _PC_CASE_INSENSITIVE
used to check if a file system performs case insensitive
name lookups.
This patch adds support for this name.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17908
Disable the aarch64 NEON SIMD intrinsics for kernel builds. Safely
using them in the kernel context requires saving/restoring the FPU
registers which is not currently done.
Additionally, remove the aarch64 optimized PREFETCH_L1 and PREFETCH_L2
instruction. Rely on the more portable compiler built ins.
This lets us remove the problematic workaround in the aarch64_compat.h
header which undefines the __aarch64__ macro.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17904Closes#17852
We have infinite loop and on certain condition, we exit this loop
and thread with pthread_exit(). But also after this loop,
we have a code to perform pthread_cleanup_pop() and return from the
thread.
The problem is that modern compilers are able to recognize that we
actually never get to the statements after loop and therefore
it is dead code there.
I think, instead of pthread_exit(), it is better to break out of loop
and let the last statements to work as intended. This is because
we do need to keep pthread_cleanup_pop() anyhow. Of course,
it is matter of taste if we want to use return or pthread_exit as very
last statement in this function.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#17900
We've heard anecdotes that suggest some
confusion/surprise/disappointment that a changed recordsize is not
applied during rewrite. Until such time as we actually can do that, we
can at least explicitly mention it at something that doesn't work.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17898
Add an introductory sentance explaining why the reader may want to use
this command, and establishing the requirement that the jail must be
running. Move other requirements from the description of the subcommands
to follow this for flow and structure. Move the caveat that this is for
FreeBSD down to a cannonical CAVEATS section, and crossreference Linux's
equivelant functionality. Mention that this utility can not be used to
delegate the root directory of the jail to that section also.
Reported by: Jan Brankamp <crest@rlwinm.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Ziaee <ziaee@FreeBSD.org>
Closes#17883
We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_*attr()
codepaths, and the FS_*_FL macros in the zpl_ioctl_*flags() codepaths.
The earlier code just assumes the FS_*_FL macros for both codepaths.
The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that
exposed this error via failing 'projectquota' ZTS tests.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17884Closes#17869
When a write comes in via dmu_sync_late_arrival, its txg is equal to the
open TXG. If that write gangs, and we have not yet activated the new
gang header feature, and the gang header we pick can store a larger gang
header, we will try to schedule the upgrade for the open TXG + 1. In
debug mode, this causes an assertion to trip. This PR sets the TXG for
activating the feature to be the larger of either the current open TXG
or the syncing TXG + 1.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17824
Update FreeBSD versions:
- add FreeBSD 15.0-STABLE
- add FreeBSD 16.0-CURRENT
So we use the latest versions of each line now:
- Freebsd 14.3 (RELEASE)
- FreeBSD 15.0 (STABLE)
- FreeBSD 16.0 (CURRENT)
In commits - you may specify which type of CI should run:
- ZFS-CI-Type: quick
- ZFS-CI-Type: linux
- ZFS-CI-Type: freebsd
- ZFS-CI-Type: full
Reviewed-by: Alexx Saver <lzsaver@users.noreply.github.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#17896
The label 'kfdok' is only used with O_TMPFILE, we need to use
the same #ifdef around this label.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#17894
Currently this function uses L0 offsets which:
1. is hard to read since it maps offsets to blkid and back each call
2. necessitates dnode_next_block to handle edge cases at limits
3. makes it hard to tell if the traversal can loop infinitely
Instead, update this and dnode_next_offset to work in (blkid, index).
This way the blkid manipulations are clear, and it's also clear that
the traversal always terminates since blkid goes one direction.
I've also considered updating dnode_next_offset to operate on blkid.
Callers use both patterns, so maybe another PR can split the cases?
While here tidy up dnode_next_offset_level comments.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#17792
In cases where all issued ZIOs must succeed, and we can't do
anything clever about the errors, we should just explicitly set
ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries.
In other cases, where retries can be different from the original,
for example, some ZIOs are allowed to fail due to redundancy, or
we can disable aggregation on retrial to get at least some of
the data, we can do first pass without TRYHARD, and only if needed
retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics).
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17877
Both DDT log and BRT counters we read on pool import and then only
append or overwrite in full blocks. We don't need them in DMU or
ARC caches. Fortunately we have DMU_UNCACHEDIO for this now.
Even more we don't need BRT in non-evictable metadata DMU caches,
since it will likely never fit there, while block the cache from
its original users. Since DMU_OT_IS_METADATA_CACHED() has no way
to differentiate the new metadata types, mark BRT with storage
type of DMU_OT_DDT_ZAP. As side effect it will also put it on
dedup device, but that should actually be right.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17875
Since we set bv_mos_brtvdev block size, and since we keep dirty
bitmap at the same granularity, we should keep the allocations
and writes done with. Otherwise it makes the last block write
short, that will be odd once we implement writing of only dirty
blocks, but also requires read-modify-write on DMU layer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17875
Ultimately this is a revert of 779ac93, which according to
@nabijaczleweli is to paper over automake <1.14's lack of
%reldir% support.
As I understand it, EL8 is the lowest current build target.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Joseph Holsten <joseph@josephholsten.com>
Closes#17878
Remove the out of date helper scripts originally used to port
Illumos commits to the ZoL repository. Due to layout changes
made to this repository they're no longer entirely correct.
Remove them to make it clear they're no longer being used or
actively maintained.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17880
Over the time many of DMU functions got flags argument to control
prefetch, caching, etc. Few functions though left without it, even
though closer look shown that many of them do not require prefetch
due to their access pattern. This patch adds the flags argument to
dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(),
passing DMU_READ_NO_PREFETCH where applicable.
I am going to also pass DMU_UNCACHEDIO to some of them later.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17872
functional/trim tests do create pools of different types to test
trim, autotrim_config.ksh is missing the type from zpool
create command line while we are looping over different pool
types.
Sponsored-by: Edgecast Cloud LLC.
Signed-off-by: Toomas Soome <tsoome@me.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17874
In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was
not initialized before the calls to zfs_uio_init, leading to
uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc
warnings.
Similar issue as fixed in 2bf152021 ("Fix gcc uninitialized warning in
FreeBSD zio_crypt.c")
Signed-off-by: Ryan Libby <rlibby@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17863
zfs-tests.sh executes test-runner.py to do the actual test work. Any
exit code < 4 is interpreted as success, with the actual value
describing the outcome of the tests inside.
If a Python program crashes in some way (eg an uncaught exception), the
process exit code is 1.
Taken together, this means that test-runner.py can crash during setup,
but return a "success" error code to zfs-tests.sh, which will report and
exit 0. This in turn causes the CI runner to believe the test run
completed successfully.
This commit addresses this by making zfs-tests.sh interpret an exit code
of 255 as a failure in the runner itself. Then, in test-runner.py, the
"fail()" function defaults to a 255 return, and the main function gets
wrapped in a generic exception handler, which prints it and calls
fail().
All together, this should mean that any unexpected failure in the test
runner itself will be propagated out of zfs-tests.sh for CI or any other
calling program to deal with.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17858
Initially, `zfs_getpages()` is provided with an array of busy pages by
the vnode pager. It then tries to acquire the range lock, but if there
is a concurrent `zfs_write()` running and fails to acquire that range
lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
After that, it grabs the pages again and retries to acquire the range
lock, and so on.
Once it got the range lock, it filters out valid pages, then copy DMU
data to the remaining invalid pages.
The problem is that freshly allocated zero'd pages it grabbed itself are
marked as valid. Therefore they are skipped by the second part of the
function and DMU data is never copied to these pages. This causes mapped
pages to contain zeros instead of the expected file content.
This was discovered while working on RabbitMQ on FreeBSD. I could
reproduce the problem easily with the following commands:
git clone https://github.com/rabbitmq/rabbitmq-server.git
cd rabbitmq-server/deps/rabbit
gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
ct-amqp_client t=cluster_size_3:leader_transfer_stream_send
The testsuite fails because there is a sendfile(2) that can happen
concurrently to a write(2) on the same file. This leads to sendfile(2)
or read(2) (after the sendfile) sending/returning data with zeros, which
causes a function to crash.
The patch consists of not setting the `VM_ALLOC_ZERO` flag when
`zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
is invalid, in case it would be partially filled with the end of the
file content. Other pages are either valid (and will be skipped) or they
will be entirely overwritten by the file content.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Jean-Sébastien Pédron <dumbbell@FreeBSD.org>
Closes#17851
Linux 6.18 has conflicting prototypes for various sha256_* and sha512_*
functions, which we get through a very long include chain. That's tough
to fix right now; easier is just to rename our internal functions.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
The namespace type has moved from the namespace ops struct to the
"common" base namespace struct. Detect this and define a macro that does
the right thing for both versions.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Linux 6.18 removed write_cache_pages() without a usable replacement.
Here we implement a minimal zpl_write_cache_pages() that find the dirty
pages within the mapping, gets them into the expected state and hands
them off to zfs_putpage(), which handles the rest.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
ida_simple_get() and ida_simple_remove() are removed in 6.18. However,
since 4.19 they have been simple wrappers around ida_alloc() and
ida_free(), so we can just use those directly.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
We must return -1 instead of ENOENT if the special zvol threading
property set function can't locate the dataset (this would typically
happen with an encypted and unmounted zvol) so that the operation
gets inserted properly into the nvlist for operations to set. This
is because we want the property to be set once the zvol is
decrypted again.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes#17836
MS-FSCC 2.6 is the governing document for
DOS attribute behavior. It specifies the following:
For a file, applications can read the file but
cannot write to it or delete it. For a directory,
applications cannot delete it, but applications can
create and delete files from the directory.
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17837
Make a minor update to the 'zpool remove' man page to clarify both
raidz and draid pools do not support removal, and change sector to
ashift which is what we actually care about.
Update the big theory comment in vdev_removal.c to accurately reflect
which types of vdevs can be removed. Furthermore, I've added some
discussion for the casual reader to briefly explain the top-level
vdev removal restrictions. This has been a common area of confusion
and it's not intuitive where they come from without understanding
the implementation details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17847
If lseek() returns an unexpected error, it's useful to know the error
code to help connect it to the trouble spot inside the module.
Since the two seek functions should be basically identical, lift them
into a single generic function.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17843
FreeBSD 15.0-ALPHA5 image fails to boot on cloud VMs due to missing
/boot/efi mount point, causing the system to drop to single user mode
where SSH cannot start. Work around this by staying on ALPHA4 and
setting IGNORE_OSVERSION=yes to bypass pkg's kernel version mismatch
prompt during bootstrap. This allows CI to proceed with ALPHA4 until we
have a stable FreeBSD 15.0 image.
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17846
A small uplift of the cmn_err() and panic() calls in userspace:
- remove the suppression on CE_NOTE. We have very few of these calls in
a standard build, it's convenient for "print debugging".
- make prefixes clear and consistent.
- add LIBZPOOL_PANIC_STOP environment variable to send SIGSTOP to the
process group on a panic, rather than abort(), so all threads remain
alive for inspection.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17834
Do not warn about vdev ashifts being smaller then physical ashifts
in a pool status if the pool ashift property set and vdev ashift
satisfies it (bigger or equal), since user explicitly requested
this. The ashift of individual vdevs are still reported.
Do not warn about vdev ashifts in zpool import, since it doesn't
matter much, and we don't even report individual vdevs ashifts
there.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17830
Before this change ashift property was applied only to a leaf
vdevs. As result, it worked only as a minimal value for parent
vdevs, since bigger physical_ashift value reported by any child
could be used instead when deciding parent's ashift, as if the
ashift property was never set.
This change explicitly passes ZPOOL_CONFIG_ASHIFT to all vdevs,
allowing override for parents only if the passed value is below
logical_ashift and so unacceptable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17826
zpool_reopen_004_pos destroys a pool with an offline disk, leaving its
label intact. In TrueNAS local repo, zpool_reopen_005_pos is skipped,
causing zpool_reopen_007_pos to fail as it doesn't use -f flag when
creating pools unlike zpool_reopen_005_pos.
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17831
The actual minimum hole size on ZFS is variable, but we always report
SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe
that they can reliably create holes at 512-byte boundaries and waste
resources trying to punch holes that ZFS ends up filling anyway.
* In the general case, if the vnode is a regular file, return its
current block size, or the record size if the file is smaller than
its own block size. If the vnode is a directory, return the dataset
record size. If it is neither a regular file nor a directory,
return EINVAL.
* In the control directory case, always return EINVAL.
Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17750
ZVOLs don't support all block layer IO request types. Add a check for
the IO types we do support. Also, remove references to
io_is_secure_erase() since they are not supported on ZVOLs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17803
Otherwise the compiler warns about it on production FreeBSD builds.
The routine proved resilient to attempts to ifdef on debug.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#17818
Previously, a bin included all blocks _starting_ from given size
(e.g., a "4K" bin would include all blocks within the [4K; 8K) region).
This is counter-intuitive and does not match the typical use-case of the
block histogram (that is, to estimate disk usage considering how ZFS'
block allocation works). In other words, if I'm looking at the "4K" row,
I'm interested in records that _fit into_ a 4K block.
Adjust the binning strategy such that a bin includes all blocks _up to_
given size, such that e.g. a "4K" bin would include all blocks within
the (2K; 4K] region.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#16999
When counting blocks to generate block size histograms (`-bb`), accept a
`--class=` argument (as a comma-separated list of either "normal",
"special", "dedup" or "other") to only consider blocks that belong to
these metaslab classes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#16999
When counting blocks to generate block size histograms (`-bb`), accept a
`--bin=` argument to force placing blocks into all three bins based on
*this* size.
E.g. with `--bin=lsize`, a block with lsize=512K, psize=128K, asize=256K
will be placed into the "512K" bin in all three output columns. This
way, by looking at the "512K" row the user will be able to determine
how well was ZFS able to compress blocks of this logical size.
Conversely, with `--bin=psize`, by looking at the "128K" row the user
will be able to determine how much overhead was incurred for storage
of blocks of this physical size.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#16999
We are adding more long-only options, so use an enum for all of them
to avoid manually numbering these constants.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#16999
In "all pools" mode, pool_iter_refresh() will call zpool_iter(), which
will call zpool_refresh_stats() before calling add_pool(). If we already
have the pool, this is a different handle, so we just release it and
return. Back in pool_iter_refresh(), we then call zpool_stats_refresh()
again for our handle on the same pool.
All together, this means we're doing two ZFS_IOC_POOL_STATS calls into
the kernel for every pool in the system. This isn't wrong, but it does
double the pressure on global locks.
Instead, we add a new function zpool_refresh_stats_from_handle() that
simply copies the pool config and state from one handle to another, and
use it to update our handle before we release it in add_pool(), so we
only have one call per pool per interval.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17807
zpool_iter() passes the callback a new instance of zpool_handle_t each
time, so the existing handle in the pool_list AVL never actually gets a
refresh. Internally, that means its zpool_config is never updated, and
the old config is never moved to zpool_old_config. As a result,
print_iostat() never sees any updated config, and so repeats the first
line forever.
This is the simplest workaround: just don't mark existing pools as
refreshed. pool_list_refresh() will see this and refresh them.
The downside is a second call to ZFS_IOC_POOL_STATS for existing pools,
because zpool_iter() just called it for the handle we threw away.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17807
When skipping the boot row (with -y), the early loop meant we weren't
updating the "last_npools" count. That means the count never advanced
past zero, so cb_iteration was always reset to 0, leading to it being
"stuck" on the boot line, printing the header and nothing else forever.
Updating the pool counter on every loop sorts that out: it advances,
cb_iteration moves properly, and normal rows are printed.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17807
If zfs_mount_and_share() fails, the error propagates to zfs create/clone
commands despite successful operation. If create/clone operations were
successful, there's no point in making zfs_mount_and_share() failures
fatal.
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17799
Originally this was created for MMP, but now new cases are emerging
where the same mechanism is required. Hence the name's generalization.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes#17793
When the default value of the xattr property was changed from 'dir' to
'sa', the code that displays the property's value was not affected. The
problem with this state of affairs is that 1) user tooling that
specifically looked for 'sa' before will be confused now that the code
displays 'on' instead. And 2) users may be confused when manually
running the commands about which specific type of xattr is in use unless
they are up to date on the latest zfs changes.
The fix here is to show the actual type always, rather than 'on' if we
happen to be using the default. This turns out to be easy to do, by
simply reordering the list of xattr values in the properties code. When
the property is displayed, we iterate down the table until we find a row
with a matching value, and use that row's name as the
display. Reordering the row fixes the display without affecting any
other code.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes#17801
When running zpool iostat in interval mode, it would not notice any new
pools created or imported, and would forget any destroyed or exported,
so would not notice if they came back. This leads to outputting "no
pools available" every interval until killed.
It looks like this was at least intended to work; the comment above
zpool_do_iostat() indicates that it is expected to "deal with pool
creation/destruction" and that pool_list_update() would detect new
pools. That call however was removed in 3e43edd2c5, though its unclear
if that broke this behaviour and it wasn't noticed, or if it never
worked, or if something later broke it. That said, the lack of
pool_list_update() is only part of the reason it doesn't work properly.
The fundamental problem is that the various things involved in
refreshing or updating the list of pools would aggressively ignore,
remove, skip or fail on pools that stop existing, or that already exist.
Mostly this meant that once a pool is removed from the list, it will
never be seen again. Restoring pool_list_update() to the
zpool_do_iostat() loop only partially fixes this - it would find "new"
pools again, but only in the "all pools" (no args) mode, and because its
iterator callback add_pool() would abort the iterator if it already has
a pool listed, it would only add pools if there weren't any already.
So, this commit reworks the structure somewhat. pool_list_update()
becomes pool_list_refresh(), and will ensure the state of all pools in
the list are updated. In the "all pools" mode, it will also add new
pools and remove pools that disappear, but when a fixed list of pools is
used, the list doesn't change, only the state of the pools within it.
The rest of the commit is adjusting things for this much simpler
structure. Regardless of the mode in use, pool_list_refresh() will
always do the right thing, so the driver code can just get on with the
display.
Now that pools can appear and disappear, I've made it so the header (if
enabled) is re-printed when the list changes, so that its easier to see
what's happening if the column widths change.
Since this is all rather complicated, I've included tests for the "all
pools" and "set of pools" modes.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17786
Add a -O option to zfs-test.sh to dump debug information on test
timeout. The debug info includes:
- 30 lines from 'top'
- /proc/<PID>/stack output of process with highest CPU usage
- Last lines strace-ing process with highest CPU usage
- /proc/sysrq-trigger kernel stack traces
All debug information gets dumped to /dev/kmsg (Linux only).
In addition, print out the VM console lines from the "Setup Testing
Machines" step. We have often see VMs timeout at this step and don't
know why.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17753
The zvol blk-mq codepaths would erroneously send FLUSH and TRIM
commands down the read codepath, rather than write. This fixes
the issue, and updates the zvol_misc_fua test to verify that
sync writes are actually happening.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17761Closes#17765
The Buildbot CI infrastructure has been fully replaced by GitHub
Actions. Remove any lingering references from the repository.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17794
When updating a Fedora instance to an experimental kernel make sure
to include the matching versioned perf and bpftool packages. This
helps ensure there are no unexpected conflicts which would prevent
the new packages from being installed.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17791
This change adds support for ZFS_KEYFORMAT_RAW to zdb_derive_key in
zdb.c. The implementation reads the raw key from the file specified
by the -K option which is consistent with how raw keys are handled in
the other parts of ZFS, along with a check to ensure that the keyfile
doesn't have too many bytes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Patrick Xia <patrickx@google.com>
Closes#17783
This changes the basic search algorithm from a single search up and down
the tree to a full depth-first traversal to handle conditions where the
tree matches at a higher level but not a lower level.
Normally higher level blocks always point to matching blocks, but there
are cases where this does not happen:
1. Racing block pointer updates from dbuf_write_ready.
Before f664f1ee7f (#8946), both dbuf_write_ready and
dnode_next_offset held dn_struct_rwlock which protected against
pointer writes from concurrent syncs.
This no longer applies, so sync context can f.e. clear or fill all
L1->L0 BPs before the L2->L1 BP and higher BP's are updated.
dnode_free_range in particular can reach this case and skip over L1
blocks that need to be dirtied. Later, sync will panic in
free_children when trying to clear a non-dirty indirect block.
This case was found with ztest.
2. txg > 0, non-hole case. This is #11196.
Freeing blocks/dnodes breaks the assumption that a match at a higher
level implies a match at a lower level when filtering txg > 0.
Whenever some but not all L0 blocks are freed, the parent L1 block is
rewritten. Its updated L2->L1 BP reflects a newer birth txg.
Later when searching by txg, if the L1 block matches since the txg is
newer, it is possible that none of the remaining L1->L0 BPs match if
none have been updated.
The same behavior is possible with dnode search at L0.
This is reachable from dsl_destroy_head for synchronous freeing.
When this happens open context fails to free objects leaving sync
context stuck freeing potentially many objects.
This is also reachable from traverse_pool for extreme rewind where it
is theoretically possible that datasets not dirtied after txg are
skipped if the MOS has high enough indirection to trigger this case.
In both of these cases, without backtracking the search ends prematurely
as ESRCH result implies no more matches in the entire object.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Robert Evans <evansr@google.com>
Closes#16025Closes#11196
Provide an interface to retrieve the lowest and highest minimum
allocation size for the normal allocation class. This can be used
by external consumers of the DMU to estimate potential wasted
capacity when setting the recordsize for an object.
The new "min_alloc" and "max_alloc" keys are added to the pool
configuration and used by default_volblocksize() to warn when
an ineffecient block size is requested. For older kmods which
don't yet include the new keys fallback to the previous logic.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17758
Three cases were discovered where 'zpool add' would fail to
warn when adding vdevs to a pool with a mismatched replication
level. These are:
1. When a pool contains mixed file and disk vdevs.
2. When a pool contains an active dRAID distributed spare
3. When a pool contains an active hot spare
The lack of warnings are caused by get_replication() assessing
the current pool configuration an inconsistent and disabling
the mismatched replication check for the new pool configuration
after 'zpool add'. This change updates get_replication() to
be slightly more tolerant in the non-fatal case.
The zpool_add_010_pos.ksh test case was split in to separate
tests: zpool_add_warn_create.ksh, pool_add_warn_degraded.ksh,
and zpool_add_warn_removal. These test were extended to
include coverage for dRAID pools and the three scenarios
described above.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17780
Modify the test case to use the `zfs mount` command instead
of directly calling the mount command, create a dedicated dataset,
and use the default mount point. These changes are intended to
preserve the intent of the original test case and resolve some
spurious mount failures which have been observed by the CI.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17785
Eliminates the need for the following workaround
> Add other drivers to dracut:
```
if grep mpt3sas /proc/modules; then
echo 'force_drivers+=" mpt3sas "' >> /etc/dracut.conf.d/zfs.conf
fi
if grep virtio_blk /proc/modules; then
echo 'filesystems+=" virtio_blk "' >> /etc/dracut.conf.d/fs.conf
fi
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jo Zzsi <jozzsicsataban@gmail.com>
Closes#17762
Spacemap entry might be too big to fit into a block pointer ashift.
We hit an assertion trying to run `zdb -bvy` on a large pool. But
it seems the code does not really need size there, since we only
need to search for a range of offsets, so setting it to zero should
just make btree return position just before the first entry. I
suspect the previous code could actually miss the first entry
due to this if its size was smaller.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17764
zfs-helpers.sh is a utility script that sets up udev symlinks so you
can run ZTS from a local ZFS git workspace. However, it doesn't check
that the udev symlinks point to the current workspace. They may point
to an old workspace that has been deleted. This means the udev rules
never get executed, which in turn causes the zvol tests to fail.
This commit removes old symlinks that do not point to the current
ZFS workspace.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17766
This commit fixes the issue and includes the zfs kernel
module even when dracut is used in hostonly mode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jo Zzsi <jozzsicsataban@gmail.com>
Closes#17754
On i386, Clang complains about misaligned atomic operations. Silence
these warnings to fix the build on FreeBSD/i386.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored by: ConnectWise
Closes#17708
Traditionally, unused dentries would be cached in the dentry cache until
the associated entry is no longer on disk. The cached dentry continues
to hold an inode reference, causing the inode to be pinned (see previous
commit).
Here we implement the dentry op d_delete, which is roughly analogous to
the drop_inode superblock op, and add a zfs_delete_dentry tunable to
control its behaviour. By default it continues the traditional
behaviour, but when the tunable is enabled, we signal that an unused
dentry should be freed immediately, releasing its inode reference, and
so allowing that inode to be deleted if no longer in use.
Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17746
Traditionally, unused inodes would be held on the superblock inode cache
until the associated on-disk file is removed or the kernel requests
reclaim. On filesystems with millions of rarely-used files, this can be
a lot of unusable memory.
Here we implement the superblock drop_inode method, and add a
zfs_delete_inode tunable to control its behaviour. By default it
continues the traditional behaviour, but when the tunable is enabled, we
signal that the inode should be deleted immediately when the last
reference is dropped, rather than cached. This releases the associated
data to the dbuf cache and ARC, allowing them to be reclaimed normally.
Sponsored-by: Klara, Inc.
Sponsored-by: Fastmail Pty Ltd
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17746
As a quality assurance measure, `typeset` is added to local variable
declarations to actually enforce their intended scope.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes#17732
This commit fixes a likely regression introduced by 64db435 where the
checksum repair functionality (`-c` or default behavior) will perform
checks and access data associated with the newer undetach (`-u`)
functionality, resulting in a failure when an uberblock's TXG is not 0
as required by `-u` but not `-c`
Additionally, code is refactored for better separation of tasks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes#17732
The current description is somewhat difficult to parse through, and in
some cases is a little unclear as to the behavior.
Split it into a paragraphs based on the three distinct behaviors you
may get: prompt, file URL, HTTP(S) URL. The descriptions of the file
and HTTP(s) behavior seems fine, but prompt is a little vague- expand
on it and make it clear that the behavior is actively based on whether
the inquisitor of key-data is provided with a tty for stdin or not.
Also clarify *why* one shouldn't "place keys which should be kept secret
on the command line" and note that you *have* to supply the key via
stdin if it's a raw key, just to be sure.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#17742
Update the fill_fs helper function to request a random fill pattern
when the "data" argument isn't specified. This ensures the default
behavior is to perform a more realistic fill of incompressible blocks.
Additionally, update a few test cases to specify a random fill.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17739
Many IO operations are submitted to the kernel async, and so the zio can
complete and followup actions before the submission call returns. If one
of the followup actions closes the disk (eg during pool create/import),
the initiator may be left holding a lock on the disk at destruction.
Instead, take the write lock before finishing up and decoupling the disk
state from the vdev proper. The caller will hold until all IO is
submitted and locks released.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17719
The time database update math assumed that the timestamps were in
nanoseconds, but at some point in the development or review process they
changed to seconds. This PR fixes the math to use seconds instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17735
Several small changes intended to make this test reliable.
- Leave the default compression enabled for the pool and switch
to using /dev/urandom as the data source. Functionally this
shouldn't impact the test but it's preferable to test with
the pool defaults when possible.
- Verify the device is created and removed as required. Switch
to a unique volume name for a more clarity in the logs.
- Use the ZVOL_DEVDIR to specify the device path.
- Speed up the test by creating the pool with an ashift=12 and
testing 4K, 8K, 128K volblocksizes.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17725
zfs_aclset_common() might be called for newly created or not even
created vnodes, that triggers assertions on newer FreeBSD versions
with DEBUG_VFS_LOCKS included into INVARIANTS. In the first case
make sure to call vn_seqc_write_begin()/_end(), in the second just
skip the assertion.
The similar has to be done for project management IOCTL and file-
bases extended attributes, since those are not going through VFS.
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17722
A new `zfs allow` permissions that ONLY allows sending replication
streams in raw (encrypted) mode, so encrypted data will not be
decrypted as part of the replication process.
Sponsored-by: Klara, Inc.
Sponsored-by: Karakun AG
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Co-authored-by: JT Pennington <jt.pennington@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#17543
Historically, ZED has blindly spawned off zedlets in parallel and never
worried about their completion order. This means that you can
potentially have zedlets for event number 2 starting before zedlets for
event number 1 had finished. Most of the time this is fine, and it
actually helps a lot when the system is getting spammed with hundreds
of events.
However, there are times when you want your zedlets to be executed
in sequence with the event ID. That is where synchronous zedlets
come in.
ZED will wait for all previously spawned zedlets to finish before
running a synchronous zedlet. Synchronous zedlets are guaranteed to be
the only zedlet running. No other zedlets may run in parallel with a
synchronous zedlet. Users should be careful to only use synchronous
zedlets when needed, since they decrease parallelism.
To make a zedlet synchronous, simply add a "-sync-" immediately
following the event name in the zedlet's file name:
EVENT_NAME-sync-ZEDLETNAME.sh
For example, if you wanted a synchronous statechange script:
statechange-sync-myzedlet.sh
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17335
While it would be nice to be able to scrub a pool imported read-only
this will currently trip an ASSERT. Before we can support this there
are some designs challenges which need to be thought through first.
For starters, a read-only import skips reading certain information
from disk which it knows won't be needed, such as the space maps.
Furthermore, the scrub process expects to be checkpoint it's progress,
update the on disk error log, and issue repair IO. None of which
would be possible when the pool is imported read-only.
Each of these wrinkles can certainly be handled, but that will take
some signifcant work. In the meanwhile we disable the 'zpool scrub'
command when the pool is imported read-only.
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #17527Closes#17717
A single slow responding disk can affect the overall read
performance of a raidz group. When a raidz child disk is
determined to be a persistent slow outlier, then have it
sit out during reads for a period of time. The raidz group
can use parity to reconstruct the data that was skipped.
Each time a slow disk is placed into a sit out period, its
`vdev_stat.vs_slow_ios count` is incremented and a zevent
class `ereport.fs.zfs.delay` is posted.
The length of the sit out period can be changed using the
`raid_read_sit_out_secs` module parameter. Setting it to
zero disables slow outlier detection.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Contributions-by: Don Brady <don.brady@klarasystems.com>
Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17227
Print a warning if you're attempting to run a ZTS test that calls
'user_run', and the ephemeral user doesn't have permissions to
access the test binaries.
This can happen if you're running ZTS from a local git repo. In
that case the test user (say, 'testuser1') may need access to the
ZTS binaries in:
/home/<your_username>/zfs/tests/zfs-tests/bin/
... but 'testuser1' doesn't have permission to enter your home dir:
/home/<your_username>
The warning will help alert users to what is going on. This will
not be an issue when ZTS is actually installed on the system
(via 'make install' or from packages).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17721
When attempting to debug performance problems on large systems, one of
the major factors that affect performance is free space
fragmentation. This heavily affects the allocation process, which is an
area of active development in ZFS. Unfortunately, fragmenting a large
pool for testing purposes is time consuming; it usually involves filling
the pool and then repeatedly overwriting data until the free space
becomes fragmented, which can take many hours. And even if the time is
available, artificial workloads rarely generate the same fragmentation
patterns as the natural workloads they're attempting to mimic.
This patch has two parts. First, in zdb, we add the ability to export
the full allocation map of the pool. It iterates over each vdev,
printing every allocated segment in the ms_allocatable range tree. This
can be done while the pool is online, though in that case the allocation
map may actually be from several different TXGs as new ones are loaded
on demand.
The second is a new subcommand for zhack, zhack metaslab leak (and its
supporting kernel changes). This is a zhack subcommand that imports a
pool and then modified the range trees of the metaslabs, allowing the
sync process to write them out normall. It does not currently store
those allocations anywhere to make them reversible, and there is no
corresponding free subcommand (which would be extremely dangerous); this
is an irreversible process, only intended for performance testing. The
only way to reclaim the space afterwards is to destroy the pool or roll
back to a checkpoint.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes#17576
This commit synchronizes the debian packaging files with the distro
version (also maintained by me) as much as possible.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Colm Buckley <colm@tuatha.org>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#17712
While rw_destroy() may do nothing on Linux, we still want to make sure
that we don't have any holders outstanding like we do for mutexes.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17718
fault_limits would often hit the 10min timeout and be killed on Fedora
41-42. Investigation showed that the 'fill_fs' portion of the test,
which would fill the pool with junk data before vdev replacement, was
writing highly compressible data (~126x), which would have taxed the
CPUs, potentially causing the timeout.
The fix is to write random data and reduce the number of writes.
This has an added benefit that more real data being is written to the
pool (~1GB) vs the old way (~300-400MB). It also speeds up the test.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#17709
FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE
which is the block size supported for block cloning for
the file system. Since ZFS's block size varies per file,
return the largest size likely to be used, or zero if block
cloning is not supported.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes#17645
If the target already exists, lt will fail. Force it to recreate the
symlinks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#17702
If we call ddt_log_load() for legacy ddt, we will end up going into
ddt_log_update_stats() and filling uninitialized value into ddo_dspace.
This value will then get added to dedup_table_size during
ddt_get_dedup_object_stats().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17019Closes#17699
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Because GitHub creates a merge commit on top of real head, so the check
on HEAD will fail regardlessly.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes#17695
If ARCH environment variable is set it can cause the failure of the
kernel modules check during the configure step. The resulting error
will be confusing, and may looks like this:
> checking for kernel config option compatibility... done
> checking whether CONFIG_MODULES is defined... no
> configure: error:
> *** This kernel does not include the required loadable module
> *** support!
Detect when ARCH is print a warning.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Maksym Shkolnyi <maksym.shkolnyi@workato.com>
Closes#17680
LLVM-21 enables -Wuninitialized-const-pointer which results in the
following compiler warning and the bdev_file_open_by_path() interface
not being detected for 6.9 and newer kernels. The blk_holder_ops
are not used by the ZFS code so we can safely use a NULL argument
for this check.
bdev_file_open_by_path/bdev_file_open_by_path.c:110:54: error:
variable 'h' is uninitialized when passed as a const pointer
argument here [-Werror,-Wuninitialized-const-pointer]
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#17682Closes#17684
glibc includes linux/stat.h for statx, but musl defines its own statx
struct and associated constants, which does not include STATX_MNT_ID
yet. Thus, including linux/stat.h directly should be avoided for
maximum libc compatibility.
Tested on:
- glibc: x86_64, i686, aarch64, armv7l, armv6l
- musl: x86_64, aarch64, armv7l, armv6l
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-By: Achill Gilgenast <achill@achill.org>
Signed-off-by: classabbyamp <dev@placeviolette.net>
Closes#17675
Add an openzfs-2.4 compatibility file for the next release.
While there are no compatibility difference between Linux and
FreeBSD for 2.4 symlinks for the -linux and -freebsd names are
created for any scripts expecting that convention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: ofthesun9 <olivier@ofthesun.net>
Closes#17672Closes#17673
The sorting logic is all in cmd/zfs/zfs_iter.c. I borrowed
where I could from the comments in the source code, but please
note that the comment to zfs_sort() is a little imprecise, or at
least incomplete, because it doesn't give any indication of the
chronological sort that will be used by default for snapshots in
zfs_compare().
While adding this description, I took the liberty to copy-edit
the rest of the file lightly.
In those edits, I've removed "If specified, you can list
property information by the absolute pathname or the relative
pathname" because, in context, it seems more confusing than
helpful.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Shawn Bayern <sbayern@law.fsu.edu>
Closes#15713Closes#15869
If $KERNEL_CC was not defined, configure status output would print an
empty string where the kernel compiler should have been. Fix this and
simplify the code generally.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes#16997
Add a test case to reproduce issue #17277:
1. Make a pool
2. Write a file to the pool
3. Mount the file as a loopback device
4. Make an XFS filesystem on the loopback device
5. Mount the XFS filesystem... <hangs>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #17277Closes#17329
The concurrent execution of feature_sync() can lead to a panic due
to an unprotected update of the feature refcount. Resolve this by
using the spa->spa_feat_stats_lock to synchronize the update of the
refcount.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#17184Closes#17632
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.