dbuf_whichblock() is not made to handle offsets beyond the block
end for single-block objects. Handle it in dmu_evict_range(),
similar to dmu_prefetch_by_dnode().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399Closes#18489
For now make it only evict the specified data from the dbuf cache.
Even though dbuf cache is small, this may still reduce eviction of
more useful data from there, and slightly accelerate ARC evictions
by making the blocks there evictable a bit sooner.
On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the
kernel translates it into POSIX_FADV_DONTNEED after every read/write.
This is not as efficient as it could be for ZFS, but that is the only
way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18399
- For multilevel gang blocks it seemed possible to fallback from
normal to special class, since they don't have proper object type,
and DMU_OT_NONE is a "metadata". They should never fallback.
- Fix possible inversion with zfs_user_indirect_is_special = 0,
when indirects written to normal vdev, while small data to special.
Make small indirect blocks also follow special_small_blocks there.
- With special_small_blocks now applying to both files and ZVOLs,
make it apply to all non-metadata without extra checks, since there
are no other non-metadata types.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18208
When the path argument to "zfs list -Ho name <path>" (or any caller of
zfs_path_to_zhandle()) is a symlink that crosses a mount boundary, the
wrong dataset is returned. Instead of returning the dataset that owns
the symlink's target, getextmntent() matches the dataset containing the
symlink itself.
For example, given two ZFS datasets "tank/ds1" and "tank/ds2", and a
symlink "/tank/ds1/link" pointing into "/tank/ds2":
$ sudo zfs list -Ho name /tank/ds1/link
tank/ds1
The expected (and previous) behavior is to return "tank/ds2", since the
symlink's target resides in that dataset.
The problem is in getextmntent(), in lib/libspl/os/linux/mnttab.c. That
function calls statx() on the caller-supplied path to obtain its mnt_id
(used to match against the mnt_id of each entry in /proc/self/mounts),
and it passes AT_SYMLINK_NOFOLLOW to that statx() call. As a result,
the mnt_id returned reflects the symlink's location rather than the
symlink target's mount, and the wrong /proc/self/mounts entry is
matched.
The same function also calls stat64() on the caller-supplied path
(used as a fallback when STATX_MNT_ID is not available, and to populate
the statbuf out-parameter). stat64() always follows symlinks, so the
statx() and stat64() calls were inconsistent: one resolved the symlink,
the other didn't. The AT_SYMLINK_NOFOLLOW behavior may be appropriate
when statx() is called on a mount entry from /proc/self/mounts (which
is always a real directory), but it is wrong for caller-supplied paths,
which may be symlinks.
This bug was introduced by 523d9d6007 ("Validate mountpoint on
path-based unmount using statx"), which added the STATX_MNT_ID code
path. However, the bug was latent: config/user-statx.m4 omitted
"#define _GNU_SOURCE" when checking for STATX_MNT_ID in <sys/stat.h>,
so HAVE_STATX_MNT_ID was never defined, and the buggy statx() path was
never compiled in. getextmntent() always fell back to the dev_t
comparison via stat64(), which correctly follows symlinks.
The fix to that autoconf check, in 2b930f63f8 ("config: fix
STATX_MNT_ID detection"), caused HAVE_STATX_MNT_ID to be properly
defined on kernels that support it, activating the broken
AT_SYMLINK_NOFOLLOW path for the first time and exposing the
regression.
The fix is to drop AT_SYMLINK_NOFOLLOW from the statx() call so that
symlinks are followed, matching the behavior of stat64() on the same
path.
Verified with a minimal reproducer: created two ZFS datasets, placed a
symlink inside the first pointing into the second, and confirmed that
"zfs list -Ho name <symlink>" returns the dataset containing the
symlink's target rather than the dataset containing the symlink.
Signed-off-by: Prakash Surya <prakash.surya@perforce.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
The spa_sync thread waits on ->spa_txg_zio and will set ZIO_WAIT_DONE
before running the sync tasks. The dmu_tx_commit() call must be done
after we add the child zio to the ->spa_txg_zio parent otherwise its
possible the child is added after txg_sync has waited.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18276
Remove redundant dsl_pool variable and duplicate spa_get_dsl()
call in vdev_rebuild_thread.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes#18263
Update freebsd15-0s builder to freebsd15-1s and point it at the
15.1-PRERELEASE tag. The previous freebsd-15.0-STABLE images are
no longer available.
Additionally, add a freebsd15-0r stanza for the RELEASE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
- Add Fedora 44 to CI tests
- Fix build issues from the newer compiler. These are mostly 'char *'
to 'const char *' conversions.
- Fix threadsappend.c test waiting for the same thread TID twice.
This caused the test to hang on F44 (but strangely not other OSs?)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18478
The d_u union introduced in 3.18 is now anonymous, so we need to detect
it and decide the right way to name d_alias.
Note that we used to have support for both names to support kernels
before 3.18, so this commit is effectively reverting the commit that
removed that support, efc293e371.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18471
Only call txg_wait_synced() when rebuild IOs were issued for this
metaslab. This is a small optimization since in practice the first
metaslab is very likely to have allocations and cause vr_last_txg
to be initialized. After this point when processing empty metaslabs
txg_wait_synced() is called but with an already committed txg so it
will not wait. Still it's better not to call txg_wait_synced() at
all when it's not needed.
Reviewed-by: Andriy Tkachuk <atkachuk@wasabi.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18482
Currently, after rebuild (aka sequential resilver), checksum
errors can be seen sometimes on the spare vdev or draid spare.
On my laptop, it happens from 2 to 4 times of running
redundancy_draid_spare1 test in a loop for 100 times.
It looks like there's a race in vdev_rebuild_thread() when the
rebuild of space map ranges is finished and we re-enable
allocations from the metaslab too soon: a new allocations may
happen from that metaslab before txg with the rebuilt ranges is
sync-ed, causing undesirable interference.
Solution: wait for the txg to be sync-ed before enabling metaslab.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18307Closes#18319Closes#18473
When sequentially resilvering a dRAID pool it's possible that a few
correctable checksum errors will be reported. This is a known issue
which is occasionally observed in the CI. Until it's resolved we
want the test case to tolerate a few checksum errors in this scenario
to prevent false positives in the CI.
This change also has the additional side effect of standardizing in
one location how the dRAID pool integrity is verified.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #18307
Issue #18319Closes#18436
Automake's default tar formats (v7 pre-1.18, ustar since) impose path
length limits that drop several long test filenames from the release
tarball when `make dist` runs. Pax format has no such limit and is
read by GNU tar 1.14+ and libarchive/bsdtar.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes: #17276Closes: #18465
- We've seen occasional 'ERROR 502: Bad Gateway' from the runner trying
to download an image with axel. Axel can open multiple connections for
a faster download, so maybe that's causing problems. This commit adds
in a fallback to curl if the axel download doesn't work.
- Update merge_summary.awk to print out killed tests in the summary.
We've seen cases where the summary page was red but there were no test
failures printed. This is because one of the VMs had too may
killed tests, which caused the total test time to run too long and
caused the runner to timeout qemu-6-test.sh. When the runner kills off
qemu-6-tests.sh, it means we never generate the nice summary page
for that VM listing the killed off tests. This commit parses the
partial test logs for killed off tests and includes them in the
merge_summary.awk output.
- Print an error message in the summary page if one of the VMs
didn't complete ZTS. This helps draw attention to a VM crash.
- FreeBSD sometimes has broken links to their CI image. When that
happens, select the newest nightly snapshot image as an alternative.
This is needed right now, since the current images in the FreeBSD 16
"current/" directory are returning 404 errors.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18460
Fix a bug where an cgroup-OOM-killed process can cause a panic:
usercopy: Kernel memory exposure attempt detected from vmalloc (offset
1007584, size 217120)!
kernel BUG at mm/usercopy.c:102!
This was caused by zfs_uiomove() not correctly returning EFAULT
for short copies.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15918Closes#18408
dmu_write_direct_done() passes dmu_sync_arg_t to
dmu_sync_done(), which updates the override state and
frees the completion context. The Direct I/O error path
then still dereferences dsa->dsa_tx while rolling the
dirty record back with dbuf_undirty(), resulting in a
use-after-free.
Save dsa->dsa_tx in a local variable before calling
dmu_sync_done() and use that saved tx for the error
rollback. This preserves the existing ownership model
for dsa and does not change the Direct I/O write
semantics.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: gality369 <gality369@example.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Closes#18440
Switch to incremental range tree processing in dnode_sync() to avoid
unsafe lock dropping during zfs_range_tree_walk(). This also ensures
the free ranges remain visible to dnode_block_freed() throughout the
sync process, preventing potential stale data reads.
This patch:
- Keeps the range tree attached during processing for visibility.
- Processes segments one-by-one by restarting from the tree head.
- Uses zfs_range_tree_clear() to safely handle ranges that may have
been modified while the lock was dropped.
- adds ASSERT()s to document that we don't expect dn_free_ranges
modification outside of sync context.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Issue #18186Closes#18235
zfs_range_tree_remove_impl() used a bare panic() when a segment to be
removed was not completely overlapped by an existing tree entry. Every
other consistency check in range_tree.c uses zfs_panic_recover(), which
respects the zfs_recover tunable and allows pools with on-disk
corruption to be imported and recovered. This one call was
inconsistent, making the partial-overlap case unrecoverable regardless
of zfs_recover.
Replace panic() with zfs_panic_recover() so that operators can set
zfs_recover=1 to import a corrupted pool and reclaim data, consistent
with all other range tree error paths.
Related-to: https://github.com/openzfs/zfs/issues/13483
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes#18255
The Makefile.am files from libshare, libtpool, libunicode, and libuutil
do not have SPDX lines. This is because those Makefiles only got SPDX
lines after the big Makefile merge in commits like 309006a0c and
0d44b58d7 (which have not been ported to this branch). Add the
Makefiles to the whitelist here so spdxcheck.pl passes.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
When copy_file_range overwrites a recent truncation, subsequent reads
can incorrectly determine that it is read hole instead of reading the
cloned blocks.
This can happen when the following conditions are met:
- Truncate adds blkid to dn_free_ranges
- A new TXG is created
- copy_file_range calls dmu_brt_clone which override the block pointer
and set DB_NOFILL
- Subsequent read, given DB_NOFILL, hits dbuf_read_impl and
dbuf_read_hole
- dbuf_read_hole calls dnode_block_freed, which returns TRUE because the
truncated blkids are still in dn_free_ranges
This will not happen if the clone and truncate are in the same TXG,
because the block clone would update the current TXG's dn_free_ranges,
which is why this bug only triggers under high IO load (such as
compilation).
Fix this by skipping the dnode_block_freed call if the block is
overridden. The fix shouldn't cause an issue when the cloned block is
subsequently freed in later TXGs, as dbuf_undirty would remove the
override.
This requires a dedicated test program as it is much harder to trigger
with scripts (this needs to generate a lot of I/O in short period of
time for the bug to trigger reliably).
Assisted-by: Gemini:gemini-3.1-pro
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gary Guo <gary@kernel.org>
Closes#18412Closes#18421
zfsctl_snapshot_mount() holds z_teardown_lock(R) across
call_usermodehelper(), which spawns a mount process that needs
namespace_sem(W) via move_mount. Reading /proc/self/mountinfo holds
namespace_sem(R) and needs z_teardown_lock(R) via zpl_show_devname.
When zfs_suspend_fs (from zfs recv or zfs rollback) queues
z_teardown_lock(W), the rrwlock blocks new readers, completing the
deadlock cycle.
Fix by releasing z_teardown_lock(R) after gathering the dataset name
and mount path, before any blocking operation. Everything after the
release operates on local string copies or uses its own
synchronization. The parent zfsvfs pointer remains valid because the
caller holds a path reference to the automount trigger dentry.
Releasing the lock allows zfs_suspend_fs to proceed concurrently
with the mount helper, so dmu_objset_hold in zpl_get_tree can
transiently fail with ENOENT during the clone swap. The mount
helper fails, EISDIR is returned, and the VFS falls back to the
ctldir stub (empty directory) until the next access retries.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18415
When getzfsvfs() succeeds (incrementing s_active via
zfs_vfs_ref()), but z_unmounted is subsequently found to
be B_TRUE, zfsvfs_hold() returns EBUSY without calling
zfs_vfs_rele(). This permanently leaks the VFS superblock
s_active reference, preventing generic_shutdown_super()
from ever firing, which blocks dmu_objset_disown() and
makes the pool permanently unexportable (EBUSY).
Add the missing zfs_vfs_rele() call, guarded by
zfs_vfs_held() to handle the zfsvfs_create() fallback
path where no VFS reference exists. This matches the
existing cleanup pattern in zfsvfs_rele().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: mischivus <1205832+mischivus@users.noreply.github.com>
Closes#18309Closes#18310
- Remove line where we disable stdout at the end of qemu-1-setup.sh
- Fix comment switching the 2x75GB -> 1x150GB cases
- Add some more debug to the end of the script
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18441
When a VM fails to launch or is unreachable the qemu-7-prepare.sh
script will fail to collect the artifacts due to the missing vm*
directories. We want to collect as much diagnostic information as
possible, when missing create the directory to allow the subsequent
steps to proceed normally. Additionally, we don't want to fail
if the /tmp/summary.txt file is missing.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18438
We've seen some qemu-1-setup failures while trying to change the
runner's block device scheduler value to 'none':
We have a single 150GB block device
Setting up swapspace version 1, size = 16 GiB (17179865088 bytes)
no label, UUID=7a790bfe-79e5-4e38-b208-9c63fe523294
tee: '/sys/block/s*/queue/scheduler': No such file or directory
Luckily, we don't need to set the scheduler anymore on modern kernels:
https://github.com/openzfs/zfs/issues/9778#issuecomment-569347505
This commit just removes the code that sets the scheduler.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18437
Update the META file to reflect compatibility with the 7.0
kernel.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18435
Replace semicolons with && so build failures are not masked by the
subsequent lockfile cleanup. Use trap to ensure the lockfile is
removed on both success and failure.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christos Longros <chris.longros@gmail.com>
Closes#18206Closes#18424
Currently, when more than nparity disks get faulted during the
rebuild, only first nparity disks would go to faulted state, and
all the remaining disks would go to degraded state. When a hot
spare is attached to that degraded disk for rebuild creating the
spare mirror, only that hot spare is getting rebuilt, but not the
degraded device. So when later during scrub some other attached
draid spare happens to map to that spare, it will end up with
cksum error.
Moreover, if the user clears the degraded disk from errors, the
data won't be resilvered to it, hot spare will be detached almost
immediately and the data that was resilvered only to it will be
lost.
Solution: write to all mirrored devices during rebuild, similar
to traditional/healing resilvering, but only if we can verify
the integrity of the data, or when it's the draid spare we are
writing to, in which case we are writing to a reserved spare
space, and there is no danger to overwrite any good data.
The argument that writing only to rebuilding draid spare vdev is
faster than writing to normal device doesn't hold since, at a
specific offset being rebuilt, draid spare will be mapped to a
normal device anyway.
redundancy_draid_degraded2 automation test is added also to
cover the scenario.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com>
Closes#18414
The GH artifacts action now lets you disable auto-zipping your
artifacts. Previously, GH would always automatically put your
artifacts in a ZIP file. This is annoying when your artifacts
are already in a tarball.
Also update the following action versions
checkout: v4 -> v6
upload-artifact: v4 -> v7
download-artifact: v4 -> v8
Lastly, fix a issue where zfs-qmeu-packages now needs to power
cycle the VM.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18411
ztest can enable and disable the multihost property when testing.
This can result in a failure when attempting to import an existing
pool when multihost=on but no /etc/hostid file exists. Update the
workflow to use zgenhostid to create /etc/hostid when not present.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18413
When sequentially resilvering allow a dRAID child to be read
as long as the DTLs indicate it should have a good copy of the
data and the leaf isn't being rebuilt. The previous check was
slightly too broad and would skip dRAID spare and replacing
vdevs if one of their children was being replaced. As long
as there exists enough additional redundancy this is fine, but
when there isn't this vdev must be read in order to correctly
reconstruct the missing data.
A new test case has been added which exhausts the available
redundancy, faults another device causing it to be degraded,
and then performs a sequential resilver for the degraded device.
In such a situation enough redundancy exists to perform the
replacement and a scrub should detect no checksum errors.
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18405
Similar to FreeBSD stop issuing prefetches on POSIX_FADV_SEQUENTIAL.
It should not have this semantics, only hint speculative prefetcher,
if access ever happen later. Instead after POSIX_FADV_WILLNEED
handling call generic_fadvise(), if available, to do all the generic
stuff, including setting f_mode in struct file, that we could later
use to control prefetcher as part of read/write operations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#18395
Free 35GB of unused files, mostly from unused development environments.
This helps with the out of disk space problems we were seeing on
FreeBSD runners.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18400
A cleanup of opportunity. Since we already are modifying the contents of
zfs_mnt_t, we've broken any API guarantee, so we might as well go the
rest of the way and get rid of it, and just pass the osname and/or the
vfs_t directly.
It seems like zfs_mnt_t was never really needed anyway; it was added in
1c2555ef92 (March 2017) to minimise the difference to illumos, but
zfs_vfsops was made platform-specific anyway in 7b4e27232d.
We also remove setting SB_RDONLY on the caller's flags when failing a
read-write remount on a read-only snapshot or pool. Since 0f608aa6ca
the caller's flags have been a pointer back to fc->sb_flags, which are
discarded without further ceremony when the operation fails, so the
change is unnecessary and we can simplify the call further.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Before Linux 5.8 (include RHEL8), a fixed set of "forbidden" options
would be rejected outright. For those, we work around it by providing
our own option parser to avoid the codepath in the kernel that would
trigger it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Adds zpl_parse_param and wires it up to the fs_context. This uses the
kernel's standard mount option parsing infrastructure to keep the work
we need to do to a minimum. We simply fill in the vfs_t we attached to
the fs_context in the previous commit, ready to go for the mount/remount
call.
Here we also document all the options we need to support, and why. It's
a lot of history but in the end the implementation is straightforward.
Finally, if we get SB_RDONLY on the proposed superblock flags, we record
that as the readonly mount option, because we haven't necessarily seen a
"ro" param and we still need to know for remount, the `readonly` dataset
property, etc.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
vfs_t is initially just parameters for the mount or remount operation,
so match them to the lifetime of the fs_context that represents that
operation.
When we actually execute the operation (calling .get_tree or .reconfigure),
transfer ownership of those options to the associated zfsvfs_t.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
We're working to replace this, and its easier to drop it outright while
we get set up.
To keep things compiling, the calls to zfsvfs_parse_options() are
replaced with zfsvfs_vfs_alloc(), though without any option parsing at
all nothing will work. That's ok, next commits are working towards it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
In a few commits, we're going to need to allocate and free vfs_t from
zpl_super.c as well, so lets keep them uniform.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18377
Currently, it's possible that draid vdev asize would decrease
after disks replacements when the disk size is a little less than
all other disks in the pool. In such situations, import would
fail on this check in vdev_open():
/*
* Make sure the allocatable size hasn't shrunk too much.
*/
if (asize < vd->vdev_min_asize) {
vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_BAD_LABEL);
return (SET_ERROR(EINVAL));
}
Solution: fix vdev_draid_min_asize() so that it would round up
the required minimal disk capacity to the VDEV_DRAID_ROWHEIGHT.
This would refuse replacements with the disks whose size is less
than minimally required to avoid draid asize decrement.
Note: we also use VDEV_DRAID_ROWHEIGHT in vdev_draid_open() when
calculating asize, and thats why we need to round up min_size at
vdev_draid_min_asize() to avoid asize drops.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18380
Normally, kernel gives any LSM registering a `sb_eat_lsm_opts` hook a
first look at mount options coming in from a userspace mount request.
The LSM may process and/or remove any options. Whatever is left is
passed to the filesystem.
This is how the dataset properties `context`, `fscontext`, `defcontext`
and `rootcontext` are used to configure ZFS mounts for SELinux. libzfs
will fetch those properties from the dataset, then add them to the mount
options.
In 0f608aa6ca (#18216) we added our own mount shims to cover the loss of
the kernel-provided ones. It turns out that if a filesystem provides a
`.parse_monolithic callback`, it is expected to do _all_ mount option
parameter processing - the kernel will not get involved at all. Because
of that, LSMs are never given a chance to process mount options. The
`context` properties are never seen by SELinux, nor are any other
options targetting other LSMs.
Fix this by calling `security_sb_eat_lsm_opts()` in
`zpl_parse_monolithic()`, before we stash the remaining options for
`zfs_domount()`.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18376
This function was removed in c6442bd3b6: "Removing old code outside
of 4.18 kernsls", but fails at present on PowerPC builds due to the
recent inclusion of 6bc9c0a90522: "powerpc: fix KUAP warning in VMX
usercopy path" in the upstream kernel, which introduces a use of
cpu_feature_keys[], which is a GPL-only symbol. Removing the API
check as it doesn't appear necessary.
Signed-off-by: John Cabaj <john.cabaj@canonical.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Do a ZFS build inside of an ARM runner. This only does a simple
build, it does not run the test suite. The build runs on the
runner itself rather than in a VM, since nesting is not supported on
Github ARM runners.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#18343
Allow restricting ZTS OS targets by setting the vars.ZTS_OS_OVERRIDE
repository variable (e.g. '["debian13"]') to reduce shared runner
contention when running the full OS matrix is unnecessary. When unset,
the existing ci_type-based OS selection is used unchanged.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18342
Target of opportunity; with no other callers, there's no need for it to
be a static function.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
Target of opportunity; with no other callers, there's no need for it to
be a static function.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
With the old API gone, there's no need to massage new-style calls into
its shape and call another function; we can just make those handlers
work directly.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
Removing the HAVE_FS_CONTEXT gates and anything that would be used if it
wasn't set.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18339
It turns out the kernel can also take directory leases, most notably in
the NFS server. Without a setlease handler on the directory file ops,
attempts to open a directory over NFS can fail with EINVAL.
Adding a directory setlease handler was missed in 168023b603. This fixes
that, allowing directories to be properly accessed over NFS.
Sponsored-by: TrueNAS
Reported-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Observed again in the CI. Put the maybe exception back in place
and reference a newly created issue for this sporadic failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18320
Update the redundancy_draid_spare1 exception to reference an issue
which describes the failure.
Remove the exception for the redundancy_draid_spare3 test. I have
not observed it in local testing. If it reproduces in the CI we
can create a new issue for it and put back the exception.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#18308
statx(2) requires _GNU_SOURCE to be defined in order for sys/stat.h to
produce a definition for struct statx and the STATX_* defines. We get
that at compile time because we pass -D_GNU_SOURCE through to
everything, but in the configure check we aren't setting _GNU_SOURCE, so
we don't find STATX_MNT_ID, and so don't set HAVE_STATX_MNT_ID.
(This was fine before ccf5a8a6fc, because linux/stat.h does not require
_GNU_SOURCE).
Simple fix: in the check, define _GNU_SOURCE before including
sys/stat.h.
Sponsored-by: TrueNAS
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18312
Currently, when there there are several faulted disks with attached
dRAID spares, and one of those disks is cleared from errors (zpool
clear), followed by its spare being detached, the data in all the
remaining spares that were attached while the cleared disk was in
FAULTED state might get corrupted (which can be seen by running scrub).
In some cases, when too many disks get cleared at a time, this can
result in data corruption/loss.
dRAID spare is a virtual device whose blocks are distributed among
other disks. Those disks can be also in FAULTED state with attached
spares on their own. When a disk gets sequentially resilvered (rebuilt),
the changes made by that resilvering won't get captured in the DTL
(Dirty Time Log) of other FAULTED disks with the attached spares to
which the data is written during the resilvering (as it would normally
be done for the changes made by the user if a new file is written or
some existing one is deleted). It is because sequential resilvering
works on the block level, without touching or looking into metadata,
so it doesn't know anything about the old BPs or transactions groups
that it is resilvering. So later on, when that disk gets cleared
from errors and healing resilvering is trying to sync all the data
from its spare onto it, all the changes made on its spare during the
resilvering of other disks will be missed because they won't be
captured in its DTL. That's why other dRAID spares may get corrupted.
Here's another way to explain it that might be helpful. Imagine a
scenario:
1. d1 fails and gets resilvered to some spare s1 - OK.
2. d2 fails and gets sequentially resilvered on draid spare s2. Now,
in some slices, s2 would map to d1, which is failed. But d1 has s1
spare attached, so the data from that resilvering goes to s1, but
not recorded in d1's DTL.
3. Now, d1 gets cleared and its s1 gets detached. All the changes
done by the user (writes or deletions) have their txgs captured
in d1's DTL, so they will be resilvered by the healing resilver
from its spare (s1) - that part works fine. But the data which
was written during resilvering of d2 and went to s1 - that one
will be missed from d1's DTL and won't get resilvered to it. So
here we are:
4. s2 under d2 is corrupted in the slices which map to d1, because
d1 doesn't have that data resilvered from s1.
Now, if there are more failed disks with draid spares attached which
were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and
so on - all their spares will be corrupted. Because, in some slices,
each of them will map to d1 which will miss their data.
Solution: add all known txgs starting from TXG_INITIAL to DTLs of
non-writable devices during sequential resilvering so when healing
resilver starts on disk clear, it would be able to check and heal
blocks from all txgs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18286Closes#18294
vdev_rebuild() is always called with spa_config_lock held in
RW_WRITER mode. However, when it tries to call dmu_tx_assign()
the latter may hang on dmu_tx_wait() waiting for available txg.
But that available txg may not happen because txg_sync takes
spa_config_lock in order to process the current txg. So we have
a deadlock case here:
- dmu_tx_assign() waits for txg holding spa_config_lock;
- txg_sync waits for spa_config_lock not progressing with txg.
Here are the stacks:
__schedule+0x24e/0x590
schedule+0x69/0x110
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
dmu_tx_wait+0x8e/0x1e0 [zfs]
dmu_tx_assign+0x49/0x80 [zfs]
vdev_rebuild_initiate+0x39/0xc0 [zfs]
vdev_rebuild+0x84/0x90 [zfs]
spa_vdev_attach+0x305/0x680 [zfs]
zfs_ioc_vdev_attach+0xc7/0xe0 [zfs]
cv_wait_common+0xf8/0x130 [spl]
__cv_wait+0x15/0x20 [spl]
spa_config_enter+0xf9/0x120 [zfs]
spa_sync+0x6d/0x5b0 [zfs]
txg_sync_thread+0x266/0x2f0 [zfs]
The solution is to pass txg returned by spa_vdev_enter(spa)
at the top of spa_vdev_attach() to vdev_rebuild() and call
dmu_tx_create_assigned(txg) which doesn't wait for txg.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com>
Closes#18210Closes#18258
The autoconf checks are more than enough to decide whether or not we can
work with this kernel or not.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18295
When a namespace property is changed via zfs set, libzfs remounts the
filesystem to propagate the new VFS mount flags. The current approach
uses mount(2) with MS_REMOUNT, which reads all namespace properties
from ZFS and applies them together. This has two problems:
1. Linux VFS resets unspecified per-mount flags on remount. If an
administrator sets a temporary flag (e.g. mount -o remount,noatime),
a subsequent zfs set on any namespace property clobbers it.
2. Two concurrent zfs set operations on different namespace properties
can overwrite each other's mount flags.
Additionally, legacy datasets (mountpoint=legacy) were never remounted
on namespace property changes since zfs_is_mountable() returns false
for them.
Add zfs_mount_setattr() which uses mount_setattr(2) to selectively
update only the mount flags that correspond to the changed property.
For legacy datasets, /proc/mounts is iterated to update all
mountpoints. On kernels without mount_setattr (ENOSYS), non-legacy
datasets fall back to a full remount; legacy mounts are skipped to
avoid clobbering temporary flags.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#18257
Checking for LD_VERSION in unreliable as not all distros define it on
the compiler's preprocessor.
Explicitly check it via autoconf.
This fixes support for Ubuntu 18.04 on arm64.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Closes#18262
This API has been available since kernel 5.2, and having it available
(almost) everywhere should give us a lot more flexibility for mount
management in the future.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18260
The traditional mount API has been removed, so detect when its not
available and instead use a small adapter to allow our existing mount
functions to keep working.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
Kernel devs noted that almost all callers to posix_acl_to_xattr() would
check the ACL value size and allocate a buffer before make the call. To
reduce the repetition, they've changed it to allocate this buffer
internally and return it.
Unfortunately that's not true for us; most of our calls are from
xattr_handler->get() to convert a stored ACL to an xattr, and that call
provides a buffer. For now we have no other option, so this commit
detects the new version and wraps to copy the value back into the
provided buffer and then free it.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
It does exactly the same thing, just inverts the return. Detect its
presence or absence and call the right one.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18216
On systems where `$kernelsrc` is different than `$kernelbuild`, the
objtool binary will be located in `$kernelbuild` as it's the result of
running `make prepare` during kernel build.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Louis Leseur <louis.leseur@gmail.com>
Closes#18248Closes#18249
The upcoming 7.0 kernel will no longer fall back to generic_setlease(),
instead returning EINVAL if .setlease is NULL. So, we set it explicitly.
To ensure that we catch any future kernel change, adds a sanity test for
F_SETLEASE and F_GETLEASE too. Since this is a Linux-specific test,
also a small adjustment to the test runner to allow OS-specific helper
programs.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@truenas.com>
Closes#18215
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.