Commit Graph

2547 Commits

Author SHA1 Message Date
Rob Norris
3b64a9619f FreeBSD: zfs_putpages: don't undirty pages until after write completes
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7dc1) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17533
2025-08-12 22:41:17 -04:00
Mark Johnston
a072611eef Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes"
This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency.  Revert for now until we have a better
approach.

This reverts commit 238eab7dc1.

Reported by:    Ihor Antonov <ngor@hugpoint.tech>
Discussed with: Rob Norris <rob.norris@klarasystems.com>

References: freebsd/freebsd-src@738a9a7
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Ported-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17533
2025-08-12 22:41:17 -04:00
Attila Fülöp
57b614e025 SIMD: Don't require definition of HAVE_XSAVE
Currently we fail the compilation via the #error directive if
`HAVE_XSAVE` isn't defined. This breaks i586 builds since we check
the toolchains SIMD support only on i686 and onward.

Remove the requirement to fix the build on i586.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #13303
Closes #17590
2025-08-12 17:23:51 -07:00
Rob Norris
f72226a75c Linux: sync: remove async/sync accounting
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
2025-08-12 17:23:39 -07:00
Olivier Certner
5289f6f961 spa: ZIO_TASKQ_ISSUE: Use symbolic priority
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

This upstreams commit fd141584cf89d7d2 from FreeBSD src.

Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17489
2025-08-12 17:16:00 -07:00
Alexander Motin
809b553940 Introduce zfs rewrite subcommand (#17246)
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-08-07 12:34:28 -04:00
Rob Norris
abb6211e7a Linux 6.16: remove writepage and readahead_page
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17443
2025-08-07 12:29:45 -04:00
khoang98
c405a7a35c Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Kaitlin Hoang <kthoang@amazon.com>
Closes #17561
2025-08-07 12:15:14 -04:00
shodanshok
4808641e71 enforce arc_dnode_limit
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.

This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17487
Closes #17542
2025-08-07 12:11:34 -04:00
Alexander Motin
30fa92bff3 Increase meta-dnode redundancy in "some" mode
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset.  It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.

This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17339
2025-08-05 13:15:44 -04:00
Paul Dagnelie
a46ce73ca8 Make ganging redundancy respect redundant_metadata property (#17073)
The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-08-05 13:10:40 -04:00
Igor Ostapenko
95abbc71c3 range_tree: Provide more debug details upon unexpected add/remove
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17581
2025-08-05 12:34:54 -04:00
Chunwei Chen
c79d5e4f33 Define sops->free_inode() to prevent use-after-free during lookup
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can
be dereferenced without refcounts and locks. For this reason, dentry and
inode must only be freed after RCU grace period.

However, zfs currently frees inode in zfs_inode_destroy synchronously
and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on
Linux 5.2 and after, if we define sops->free_inode(), the kernel will do
call_rcu() for us.

This issue may be triggered more easily with init_on_free=1 boot
parameter:

BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:selinux_inode_permission+0x10e/0x1c0
Call Trace:
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? security_inode_permission+0x37/0x60
 ? __die_body.cold+0x8/0xd
 ? no_context+0x113/0x220
 ? exc_page_fault+0x6d/0x130
 ? asm_exc_page_fault+0x1e/0x30
 ? selinux_inode_permission+0x10e/0x1c0
 security_inode_permission+0x37/0x60
 link_path_walk.part.0.constprop.0+0xb5/0x360
 ? path_init+0x27d/0x3c0
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xc0/0x1d0
 ? __check_object_size.part.0+0x123/0x150
 ? strncpy_from_user+0x4e/0x130
 ? getname_flags.part.0+0x4b/0x1c0
 vfs_statx+0x72/0x120
 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
 __do_sys_newlstat+0x39/0x70
 ? __x64_sys_ioctl+0x8d/0xd0
 do_syscall_64+0x30/0x40
 entry_SYSCALL_64_after_hwframe+0x62/0xc7

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes #17546
2025-08-05 12:30:23 -04:00
Paul Dagnelie
661310ff5c FDT dedup log sync -- remove incremental
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Don Brady <don.brady@klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17038
2025-08-05 12:15:21 -04:00
Rob Norris
b00bc81b05 ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat
These are only required to support these ioctls on Linux <4.5. Since
4.18 is our cutoff, we don't need this code anymore.

Also removing related test things that will never match again.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17308
2025-06-17 10:50:27 -07:00
Don Brady
6b67a5bdd3 During pool export flush the ARC asynchronously
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16215
2025-06-17 10:50:26 -07:00
Rob Norris
a65225ec7e FreeBSD: zfs_putpages: don't undirty pages until after write completes
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17445
2025-06-17 10:50:26 -07:00
Alexander Motin
0c9cdd1606 Improve block cloning transactions accounting
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free.  While they might be from some points, cloning
is not net-free.  It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17431
2025-06-17 10:50:26 -07:00
Attila Fülöp
e9c1e08e07 Linux build: silence objtool warnings
After #17401 the Linux build produces some stack related warnings.

Silence them with the `STACK_FRAME_NON_STANDARD` macro.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17410
2025-06-17 10:50:26 -07:00
Rob Norris
b8f80812a3 tunables: remove __check_old_set_param workaround
This was fully removed from Linux in 4.15, so we won't be seeing it
again.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
97696962b5 tunables: remove unused param get/set aliases
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
06fd6dc6f7 tunables: use Linux ullong param ops for u64
Since 3.17 Linux has provided param ops for 64-bit ints, so we don't
need to use our own anymore.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
28ff5ff1c6 tunables: remove support for s64 tunables
Nothing uses them now.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
840b070ec7 tunables: remove FreeBSD compat macros for Linux module params
Nothing in any FreeBSD code uses them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Ameer Hamza
1215c3b609 Expose dataset encryption status via fast stat path
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17368
2025-06-17 10:49:40 -07:00
Alexander Motin
97c1fb6ad5 ARC: Notify dbuf cache about target size reduction
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too.  Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17314
(cherry picked from commit 89a8a91582)
2025-05-28 16:00:28 -07:00
Alexander Motin
db290fd48b Linux: Stop using NR_FILE_PAGES for ARC scaling
I've found that QEMU/KVM guest memory accounted as shared also
included into NR_FILE_PAGES.  But it is actually a non-evictable
anonymous memory.  Using it as a base for zfs_arc_pc_percent
parameter makes ARC to ignore shrinker requests while page cache
does not really have anything to evict, ending up in OOM killer
killing the QEMU process.

Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent
the part of a page cache that is actually evictable, which should
be safer to use as a reference for ARC scaling.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17334
(cherry picked from commit 0aa83dce99)
2025-05-28 16:00:28 -07:00
Rob Norris
db988fabfb linux/uio: remove "skip" offset for UIO_ITER
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17298
(cherry picked from commit 2ee5b51a57)
2025-05-28 16:00:28 -07:00
Rob Norris
ce9cd12c97 txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.

This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.

Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.

The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit a7de203c86)
2025-05-28 16:00:28 -07:00
Rob Norris
c85f2fd531 cred: properly pass and test creds on other threads (#17273)
### Background

Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.

Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.

However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.

### Current implementation

The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.

For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.

### Changes

We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.

On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit c8fa39b46c)
2025-05-28 16:00:28 -07:00
Alexander Motin
243a46f28d Cleanup VERIFY() macros (#17163)
- Fix VERIFY3B() when given non-boolean values.
 - Map EQUIV() into VERIFY3B(,==,) as equivalent.
 - Tune messages for better readability and to closer match source
code for easier search.  Unify user-space messages with kernel.
 - Tune printed types and remove %px outside of Linux kernel.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: @ImAwsumm
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit 4866c2fabf)
2025-05-28 16:00:28 -07:00
Paul Dagnelie
fbac52e1e9 Fix FDT rollback to not overwrite unnecessary fields (#17205)
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-16 09:59:45 -07:00
Paul Dagnelie
9f0be8fca0 Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17183
2025-04-16 09:59:45 -07:00
Piotr Kubaj
12657df52a simd_powerpc.h: enable FPU on FreeBSD
FreeBSD nowadays supports FPU in the kernel on powerpc*, so enable it.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Piotr Kubaj <pkubaj@FreeBSD.org>
Closes #17191
2025-04-16 09:59:45 -07:00
Ameer Hamza
ab455c7b80 zed: Ensure spare activation after kernel-initiated device removal
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17187
2025-04-16 09:59:45 -07:00
Rob Norris
9e009acbdc dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143)
This helps to avoids confusion with the similarly-named
txg_wait_synced().

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-16 09:59:45 -07:00
Rob Norris
76d0c74c35 SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain
SPDX have repeatedly rejected the creation of a tag for a public domain
dedication, as not all dedications are clear and unambiguious in their
meaning and not all jurisdictions permit relinquishing a copyright
anyway.

A reasonably common workaround appears to be to create a local
(project-specific) identifier to convey whatever meaning the project
wishes it to. To cover OpenZFS' use of third-party code with a public
domain dedication, we use this custom tag.

Further reading:
- https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md
- https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/
- https://cr.yp.to/spdx.html

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:45 -07:00
Rob Norris
6b2c046d18 SPDX: license tags: GPL-2.0-or-later
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
091da72c66 SPDX: license tags: MIT
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
8cacac7ed4 SPDX: license tags: BSD-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
865ca576ab SPDX: license tags: BSD-2-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
9530eb64e0 SPDX: license tags: CDDL-1.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
shodanshok
52f3f92bbf Add receive:append permission for limited receive
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.

This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.

Fixes #16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17015
2025-04-02 17:06:40 -07:00
Alexander Motin
53cbf06d68 Fix deduplication of overridden blocks
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry.  But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.

This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested.  And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17120
2025-04-02 17:05:24 -07:00
Rob Norris
6503f8c6f0 Linux/vnops: implement STATX_DIOALIGN
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16972
2025-04-02 17:04:14 -07:00
Alan Somers
ad07b09cc3 Verify every block pointer is either embedded, hole, or has a valid DVA
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:

```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```

This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.

And in zdb, don't crash when printing a block pointer with no valid
DVAs.  If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug.  zdb should be able to handle the
situation gracefully.

Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR.  This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why.  It doesn't appear to do anything useful, so remove it.

[^1]: b24ab67627

Fixes		#17077
Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17078
2025-04-02 17:03:01 -07:00
Fabian-Gruenbichler
112dca3483 linux: zvols: correctly detect flush requests (#17131)
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598519
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <f.gruenbichler@proxmox.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-02 17:03:01 -07:00
Rob Norris
5f7037067e
Revert "zinject: count matches and injections for each handler" (#17137)
Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t,
preventing some things working properly with 2.3.1 userspace tools
against 2.3.0 kernel module.

This reverts commit fabdd502f4.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-24 13:49:10 -07:00
Ameer Hamza
637f918211 arc: avoid possible deadlock in arc_read
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17071
2025-02-28 00:42:29 +05:00
aokblast
383256c329 spa: fix signature mismatch for spa_boot_init as eventhandler required
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17088
2025-02-28 00:42:29 +05:00