Commit Graph

2530 Commits

Author SHA1 Message Date
Alexander Motin
0c9cdd1606 Improve block cloning transactions accounting
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free.  While they might be from some points, cloning
is not net-free.  It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17431
2025-06-17 10:50:26 -07:00
Attila Fülöp
e9c1e08e07 Linux build: silence objtool warnings
After #17401 the Linux build produces some stack related warnings.

Silence them with the `STACK_FRAME_NON_STANDARD` macro.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17410
2025-06-17 10:50:26 -07:00
Rob Norris
b8f80812a3 tunables: remove __check_old_set_param workaround
This was fully removed from Linux in 4.15, so we won't be seeing it
again.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
97696962b5 tunables: remove unused param get/set aliases
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
06fd6dc6f7 tunables: use Linux ullong param ops for u64
Since 3.17 Linux has provided param ops for 64-bit ints, so we don't
need to use our own anymore.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
28ff5ff1c6 tunables: remove support for s64 tunables
Nothing uses them now.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Rob Norris
840b070ec7 tunables: remove FreeBSD compat macros for Linux module params
Nothing in any FreeBSD code uses them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-06-17 10:50:26 -07:00
Ameer Hamza
1215c3b609 Expose dataset encryption status via fast stat path
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17368
2025-06-17 10:49:40 -07:00
Alexander Motin
97c1fb6ad5 ARC: Notify dbuf cache about target size reduction
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too.  Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17314
(cherry picked from commit 89a8a91582)
2025-05-28 16:00:28 -07:00
Alexander Motin
db290fd48b Linux: Stop using NR_FILE_PAGES for ARC scaling
I've found that QEMU/KVM guest memory accounted as shared also
included into NR_FILE_PAGES.  But it is actually a non-evictable
anonymous memory.  Using it as a base for zfs_arc_pc_percent
parameter makes ARC to ignore shrinker requests while page cache
does not really have anything to evict, ending up in OOM killer
killing the QEMU process.

Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent
the part of a page cache that is actually evictable, which should
be safer to use as a reference for ARC scaling.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17334
(cherry picked from commit 0aa83dce99)
2025-05-28 16:00:28 -07:00
Rob Norris
db988fabfb linux/uio: remove "skip" offset for UIO_ITER
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17298
(cherry picked from commit 2ee5b51a57)
2025-05-28 16:00:28 -07:00
Rob Norris
ce9cd12c97 txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.

This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.

Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.

The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit a7de203c86)
2025-05-28 16:00:28 -07:00
Rob Norris
c85f2fd531 cred: properly pass and test creds on other threads (#17273)
### Background

Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.

Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.

However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.

### Current implementation

The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.

For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.

### Changes

We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.

On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit c8fa39b46c)
2025-05-28 16:00:28 -07:00
Alexander Motin
243a46f28d Cleanup VERIFY() macros (#17163)
- Fix VERIFY3B() when given non-boolean values.
 - Map EQUIV() into VERIFY3B(,==,) as equivalent.
 - Tune messages for better readability and to closer match source
code for easier search.  Unify user-space messages with kernel.
 - Tune printed types and remove %px outside of Linux kernel.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: @ImAwsumm
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
(cherry picked from commit 4866c2fabf)
2025-05-28 16:00:28 -07:00
Paul Dagnelie
fbac52e1e9 Fix FDT rollback to not overwrite unnecessary fields (#17205)
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-16 09:59:45 -07:00
Paul Dagnelie
9f0be8fca0 Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17183
2025-04-16 09:59:45 -07:00
Piotr Kubaj
12657df52a simd_powerpc.h: enable FPU on FreeBSD
FreeBSD nowadays supports FPU in the kernel on powerpc*, so enable it.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Piotr Kubaj <pkubaj@FreeBSD.org>
Closes #17191
2025-04-16 09:59:45 -07:00
Ameer Hamza
ab455c7b80 zed: Ensure spare activation after kernel-initiated device removal
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17187
2025-04-16 09:59:45 -07:00
Rob Norris
9e009acbdc dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143)
This helps to avoids confusion with the similarly-named
txg_wait_synced().

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-16 09:59:45 -07:00
Rob Norris
76d0c74c35 SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain
SPDX have repeatedly rejected the creation of a tag for a public domain
dedication, as not all dedications are clear and unambiguious in their
meaning and not all jurisdictions permit relinquishing a copyright
anyway.

A reasonably common workaround appears to be to create a local
(project-specific) identifier to convey whatever meaning the project
wishes it to. To cover OpenZFS' use of third-party code with a public
domain dedication, we use this custom tag.

Further reading:
- https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md
- https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/
- https://cr.yp.to/spdx.html

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:45 -07:00
Rob Norris
6b2c046d18 SPDX: license tags: GPL-2.0-or-later
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
091da72c66 SPDX: license tags: MIT
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
8cacac7ed4 SPDX: license tags: BSD-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
865ca576ab SPDX: license tags: BSD-2-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
Rob Norris
9530eb64e0 SPDX: license tags: CDDL-1.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:59:44 -07:00
shodanshok
52f3f92bbf Add receive:append permission for limited receive
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.

This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.

Fixes #16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17015
2025-04-02 17:06:40 -07:00
Alexander Motin
53cbf06d68 Fix deduplication of overridden blocks
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry.  But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.

This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested.  And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17120
2025-04-02 17:05:24 -07:00
Rob Norris
6503f8c6f0 Linux/vnops: implement STATX_DIOALIGN
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16972
2025-04-02 17:04:14 -07:00
Alan Somers
ad07b09cc3 Verify every block pointer is either embedded, hole, or has a valid DVA
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:

```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```

This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.

And in zdb, don't crash when printing a block pointer with no valid
DVAs.  If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug.  zdb should be able to handle the
situation gracefully.

Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR.  This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why.  It doesn't appear to do anything useful, so remove it.

[^1]: b24ab67627

Fixes		#17077
Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17078
2025-04-02 17:03:01 -07:00
Fabian-Gruenbichler
112dca3483 linux: zvols: correctly detect flush requests (#17131)
since 4.10, bio->bi_opf needs to be checked to determine all kinds of
flush requests. this was the case prior to the commit referenced below,
but the order of ifdefs was not the usual one (newest up top), which
might have caused this to slip through.

this fixes a regression when using zvols as Qemu block devices, but
might have broken other use cases as well. the symptoms are that all
sync writes from within a VM configured to use such a virtual block
devices are ignored and treated as async writes by the host ZFS layer.

this can be verified using fio in sync mode inside the VM, for example
with

 fio \
 --filename=/dev/sda --ioengine=libaio --loops=1 --size=10G \
 --time_based --runtime=60 --group_reporting --stonewall --name=cc1 \
 --description="CC1" --rw=write --bs=4k --direct=1 --iodepth=1 \
 --numjobs=1 --sync=1

which shows an IOPS number way above what the physical device underneath
supports, with "zpool iostat -r 1" on the hypervisor side showing no
sync IO occuring during the benchmark.

with the regression fixed, both fio inside the VM and the IO stats on
the host show the expected numbers.

Fixes: 846b598519
"config: remove HAVE_REQ_OP_* and HAVE_REQ_*"

Signed-off-by: Fabian-Gruenbichler <f.gruenbichler@proxmox.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-02 17:03:01 -07:00
Rob Norris
5f7037067e
Revert "zinject: count matches and injections for each handler" (#17137)
Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t,
preventing some things working properly with 2.3.1 userspace tools
against 2.3.0 kernel module.

This reverts commit fabdd502f4.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-24 13:49:10 -07:00
Ameer Hamza
637f918211 arc: avoid possible deadlock in arc_read
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17071
2025-02-28 00:42:29 +05:00
aokblast
383256c329 spa: fix signature mismatch for spa_boot_init as eventhandler required
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17088
2025-02-28 00:42:29 +05:00
Rob Norris
92d1686a2a include: move zio_priority_t into zfs.h
It's included so it's effectively already part of it, but it's not
always installed as a userspace header, making zfs.h effectively
useless. Might as well just combine it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Close #17066
2025-02-28 00:42:29 +05:00
Rob Norris
1bdce0410c range_tree: convert remaining range_* defs to zfs_range_*
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-02-28 00:42:29 +05:00
Ivan Volosyuk
55b21552d3 Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_*
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.

Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-02-28 00:42:29 +05:00
Brian Atkinson
0e21e473a7 Update pin_user_pages() calls for Direct I/O
Originally #16856 updated Linux Direct I/O requests to use the new
pin_user_pages API. However, it was an oversight that this PR only
handled iov_iter's of type ITER_IOVEC and ITER_UBUF. Other iov_iter
types may try and use the pin_user_pages API if it is available. This
can lead to panics as the iov_iter is not being iterated over correctly
in zfs_uio_pin_user_pages().

Unfortunately, generic iov_iter API's that call pin_user_page_fast() are
protected as GPL only. Rather than update zfs_uio_pin_user_pages() to
account for all iov_iter types, we can simply just call
zfs_uio_get_dio_page_iov_iter() if the iov_iter type is not ITER_IOVEC
or ITER_UBUF. zfs_uio_get_dio_page_iov_iter() calls the
iov_iter_get_pages() calls that can handle any iov_iter type.

In the future it might be worth using the exposed iov_iter iterator
functions that are included in the header iov_iter.h since v6.7. These
functions allow for any iov_iter type to be iterated over and advanced
while applying a step function during iteration. This could possibly be
leveraged in zfs_uio_pin_user_pages().

A new ZFS test case was added to test that a ITER_BVEC is handled
correctly using this new code path. This test case was provided though
issue #16956.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16956 
Closes #17006
2025-02-25 22:33:25 +05:00
Alan Somers
6e9911212e Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Sponsored by:	ConnectWise
Closes #16980
2025-02-25 22:32:11 +05:00
Rob Norris
a28f5a94f4 zinject: add "probe" device injection type
Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.

This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16947
2025-02-25 22:29:33 +05:00
Rob Norris
0dfcfe023e zinject: make iotype extendable
I'm about to add a new "type", and I need somewhere to put it!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16947
2025-02-25 22:29:02 +05:00
Peng Liu
404254bacb style: remove unnecessary spaces in sa.h
Removed three unnecessary spaces in the definition of the
sa_attr_reg_t structure to improve code style consistency
and adhere to OpenZFS coding standards.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Peng Liu <littlenewton6@gmail.com>
Closes #16955
2025-02-25 22:26:45 +05:00
Rob Norris
fabdd502f4 zinject: count matches and injections for each handler
When building tests with zinject, it can be quite difficult to work out
if you're producing the right kind of IO to match the rules you've set
up.

So, here we extend injection records to count the number of times a
handler matched the operation, and how often an error was actually
injected (ie after frequency and other exclusions are applied).

Then, display those counts in the `zinject` output.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16938
2025-02-25 22:25:24 +05:00
Richard Kojedzinszky
5ba50c8135 fix: make zfs_strerror really thread-safe and portable
#15793 wanted to make zfs_strerror threadsafe, unfortunately, it
turned out that strerror_l() usage was wrong, and also, some libc 
implementations dont have strerror_l().

zfs_strerror() now simply calls original strerror() and copies the 
result to a thread-local buffer, then returns that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Kojedzinszky <richard@kojedz.in>
Closes #15793
Closes #16640
Closes #16923
2025-01-04 11:58:15 -08:00
Ameer Hamza
b952e061df zvol: implement platform-independent part of block cloning
In Linux, block devices currently lack support for `copy_file_range`
API because the kernel does not provide the necessary functionality.
However, there is an ongoing upstream effort to address this
limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/.
We have adopted this upstream kernel patch into the TrueNAS kernel and
made some additional modifications to enable block cloning specifically
for the zvol block device. This patch implements the platform-
independent portions of these changes for inclusion in OpenZFS.
This patch does not introduce any new functionality directly into
OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when
zvols are migrated to non-TrueNAS systems that support Clone Range
replay in the ZIL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16901
2024-12-29 11:53:45 -08:00
Brian Atkinson
1862c1c0a8 Removing old code outside of 4.18 kernsls
There were checks still in place to verify we could completely use
iov_iter's on the Linux side. All interfaces are available as of kernel
4.18, so there is no reason to check whether we should use that
interface at this point. This PR completely removes the UIO_USERSPACE
type. It also removes the check for the direct_IO interface checks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16856
2024-12-16 10:26:49 -08:00
Rob Norris
0d51852ec7 Remove unnecessary CSTYLED escapes on top-level macro invocations
cstyle can handle these cases now, so we don't need to disable it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 09:05:02 -08:00
Alexander Motin
d90042dedb Allow dsl_deadlist_open() return errors
In some cases like dsl_dataset_hold_obj() it is possible to handle
those errors, so failure to hold dataset should be better than
kernel panic.  Some other places where these errors are still not
handled but asserted should be less dangerous just as unreachable.

We have a user report about pool corruption leading to assertions
on these errors.  Hopefully this will make behavior a bit nicer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16836
2024-12-05 09:33:21 -08:00
Mariusz Zaborski
3b0c1131ef Add ability to scrub from last scrubbed txg
Some users might want to scrub only new data because they would like
to know if the new write wasn't corrupted.  This PR adds possibility
scrub only newly written data.

This introduces new `last_scrubbed_txg` property, indicating the
transaction group (TXG) up to which the most recent scrub operation
has checked and repaired the dataset, so users can run scrub only
from the last saved point. We use a scn_max_txg and scn_min_txg
which are already built into scrub, to accomplish that.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Closes #16301
2024-12-05 09:33:21 -08:00
Alexander Motin
00debc1361 FreeBSD: Remove some illumos compat from vnode.h
Should make no difference, just some dead code cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by:Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16808
2024-12-05 09:33:21 -08:00
Alexander Motin
af10714e42 FreeBSD: Return ifndef IN_BASE back to fix the build
FreeBSD's libprocstat seems to build kernel code in user space,
which does not work here due to undefined vnode_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by:Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16808
2024-12-05 09:33:21 -08:00