Commit Graph

1427 Commits

Author SHA1 Message Date
Rob Norris
c39e076f23 Prefer VERIFY0(n) over VERIFY(n == 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:40:59 -07:00
Alexander Motin
60f714e6e2 Implement physical rewrites
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:56 -07:00
Alexander Motin
4ae8bf406b Allow physical rewrite without logical
During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:07 -07:00
Mariusz Zaborski
894edd084e
Add TXG timestamp database
This feature enables tracking of when TXGs are committed to disk,
providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based
on specific date ranges, improving the granularity of data
management and recovery operations.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16853
2025-08-06 10:31:21 -07:00
Rob Norris
a18c9edda6 Linux: sync: remove async/sync accounting
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
2025-08-06 09:54:30 -07:00
Paul Dagnelie
31c4fa93bb Fix dynamic gang block headers on raidz and mirror devices
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17587
2025-08-06 09:50:58 -07:00
Fedor Uporov
0b6fd024a7
ZVOL: Unify zvol minors operations and improve error handling
Now zvol minors creation logic is passed thru spa_zvol_taskq, like it
is doing for remove/rename zvol minors functions. Appropriate
zvol minors creation functions are refactored:
- The zvol_create_minor()/zvol_minors_create_recursive() were removed.
- The single zvol_create_minors() is added instead.

Also, it become possible to collect zvol minors subtasks status, to
detect, if some zvol minor subtask is failed in the subtasks chain.
The appropriate message is reported to zfs_dbgmsg buffer in this case.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17575
2025-08-06 10:10:52 -04:00
khoang98
0f8a1105ee
Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Kaitlin Hoang <kthoang@amazon.com>
Closes #17561
2025-08-01 16:47:41 -07:00
Igor Ostapenko
cb5e7e097d
range_tree: Provide more debug details upon unexpected add/remove
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17581
2025-07-31 10:44:42 -04:00
rmacklem
2957eabbef
Add support for FreeBSD's Solaris style extended attribute interface
FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface
for extended attributes.  This patch wires this interface into the
FreeBSD ZFS port, since this style of extended attributes is supported
by OpenZFS internally when the "xattr" property is set to "dir".

Some specific changes:
LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR
for calls to zfs_zaccess().
V_NAMEDATTR indicates that the access checking does need to be done
for FreeBSD.

The access checking code for extended attributes was copy/pasted from
the Linux port into zfs_zaccess() in the FreeBSD port.

Most of the changes are in zfs_freebsd_lookup() and
zfs_freebsd_create().
The semantics of these functions should remain unchanged unless named
attributes are being manipulated.

All the code changes are enabled for __FreeBSD_version 1500040 and
newer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes #17540
2025-07-30 09:49:43 -07:00
Rob Norris
00ce064d8f
spa: update blkptr diagram to include vdev padding on encrypted blocks
Probably just an oversight in 4d044c4c1d. SPA_VDEVBITS is always 24,
regardless of whether or not the bp is for an encrypted block, and it
wouldn't make sense for it to be different anyway.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17564
2025-07-24 09:50:23 -04:00
Rob Norris
96d20d7d59 linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:07:36 -07:00
shodanshok
a7a144e655
enforce arc_dnode_limit
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.

This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17487
Closes #17542
2025-07-21 10:32:01 -07:00
Alexander Motin
be1e991a1a
Allow and prefer special vdevs as ZIL
Before this change ZIL blocks were allocated only from normal or
SLOG vdevs.  In typical situation when special vdevs are SSDs and
normal are HDDs it could cause weird inversions when data blocks
are written to SSDs, but ZIL referencing them to HDDs.

This change assumes that special vdevs typically have much better
(or at least not worse) latency than normal, and so in absence of
SLOGs should store ZIL blocks.  It means similar to normal vdevs
introduction of special embedded log allocation class and updating
the allocation fallback order to: SLOG -> special embedded log ->
special -> normal embedded log -> normal.

The code tries to guess whether data block is going to be written
to normal or special vdev (it can not be done precisely before
compression) and prefer indirect writes for blocks written to a
special vdev to avoid double-write.  For blocks that are going to
be written to normal vdev, special vdev by default plays as SLOG,
reducing write latency by the cost of higher special vdev wear,
but it is tunable via module parameter.

This should allow HDD pools with decent SSD as special vdev to
work under synchronous workloads without requiring additional
SLOG SSD, impractical in many scenarios.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17505
2025-07-18 18:44:14 -07:00
Rob Norris
fce18e04d5 libzpool: tunable-based option interface for zdb/ztest
Removes the old dlsym() based option setter and adds a new
function handle_tunable_option() that can set, get and list all the
tunables in the system. And then wire it up to zdb and ztest.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17537
2025-07-15 15:47:03 -07:00
Rob Norris
3a494c6d2a mod.h: make consistent across all three platforms
mod.h only exists to include the platform-specific mod_os.h, so we can
get rid of it and just call the platform header mod.h.

Then, create a libspl mod.h, and move the relevant items to it so we can
start building on it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17537
2025-07-15 15:46:14 -07:00
Paul Dagnelie
a981cb69e4 Implement dynamic gang header sizes
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17004
2025-07-09 14:02:53 -07:00
Rob Norris
6af8db61b1
metaslab: don't pass whole zio to throttle reserve APIs
They only need a couple of fields, and passing the whole thing just
invites fiddling around inside it, like modifying flags, which then
makes it much harder to understand the zio state from inside zio.c.

We move the flag update to just after a successful throttle in zio.c.

Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED
Better describes what it means, and makes it look less like
IO_IS_ALLOCATING, which means something different.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17508
2025-07-04 23:22:22 -04:00
Rob Norris
92d3b4ee2c
zio: rename io_reexecute as io_post; use it for the direct IO checksum error flag
We're not supposed to modify someone else's io_flags, so we need another
way to propagate DIO_CHKSUM_ERR.

If we squint, we can see that io_reexecute is really just recording
exceptional events that a parent (or its parents) will need to do
something about. It just happens that the only things we've had
historically are two forms of reexecution: now or later (suspend).

So, rename it to io_post, as in, post-IO info/events/actions. And now we
have a few spare bits for other conditions.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17507
2025-07-04 23:16:14 -04:00
Alexander Motin
4e92aee233
Relax special_small_blocks restrictions
special_small_blocks is applied to blocks after compression, so it
makes no sense to demand its values to be power of 2.  At most
they could be multiple of 512, but that would still buy us nothing,
so lets allow them be any within SPA_MAXBLOCKSIZE.

Also special_small_blocks does not really need to depend on the
set recordsize, enabled pool features or presence of special vdev.
At worst in any of those cases it will just do nothing, so we
should not complicate users lives by artificial limitations.

While there, polish comments for recordsize and volblocksize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17497
2025-07-02 11:11:37 -07:00
Olivier Certner
dee62e074a
spa: ZIO_TASKQ_ISSUE: Use symbolic priority
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

This upstreams commit fd141584cf89d7d2 from FreeBSD src.

Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17489
2025-06-30 10:24:23 -04:00
Alexander Motin
e0ef4d2768
Improve block cloning transactions accounting
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free.  While they might be from some points, cloning
is not net-free.  It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17431
2025-06-11 11:59:16 -07:00
Rob Norris
560e3170ef dsl_dataset: rename dmu_objset_clone* to dsl_dataset_clone*
And make its check and sync functions visible, so I can hook them up to
zcp_synctask. Rename not strictly necessary, but it definitely looks
more like a dsl_dataset thing than a dmu_objset thing, to the extent
that those things even have a meaningful distinction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17426
2025-06-10 14:52:43 -07:00
Attila Fülöp
b96f1a4b1f
Linux build: silence objtool warnings
After #17401 the Linux build produces some stack related warnings.

Silence them with the `STACK_FRAME_NON_STANDARD` macro.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17410
2025-06-04 17:40:09 -07:00
Alexander Motin
68817d28c5
Include class name into struct metaslab_class
With increasing number of metaslab classes it can be helpful for
debugging to know what we are looking at.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17409
2025-06-03 11:12:59 -04:00
Alexander Motin
108562344c
Improve allocation fallback handling
Before this change in case of any allocation error ZFS always fallen
back to normal class.  But with more of different classes available
we migth want more sophisticated logic.  For example, it makes sense
to fall back from dedup first to special class (if it is allowed to
put DDT there) and only then to normal, since in a pool with dedup
and special classes populated normal class likely has performance
characteristics unsuitable for dedup.

This change implements general mechanism where fallback order is
controlled by the same spa_preferred_class() as the initial class
selection.  And as first application it implements the mentioned
dedup->special->normal fallbacks.  I have more plans for it later.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17391
2025-05-31 19:12:16 -04:00
Fedor Uporov
e1677d9ee1
ZVOL: Make zvol_prefetch_bytes module parameter platform-independent
The module parameter now is represented in FreeBSD sysctls list
with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072,
same as on Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17385
2025-05-31 09:58:54 -04:00
Rob Norris
e8e602d987
zio_add_child: collapse into a single function
The child locking difference is simple enough to handle with a boolean.
The actual work is more involved, and it's easy to forget to change
things in both places when experimenting. Just collapse them.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17382
2025-05-30 21:18:10 -04:00
Rob Norris
44e3266894
events: include zio type in IO error reports
Usually the IO type can be inferred from the other fields (in
particular, priority and flags) sometimes it's not easy to see. This is
just another little debug helper.

    May 27 2025 00:54:54.024110493 ereport.fs.zfs.data
            class = "ereport.fs.zfs.data"
            ena = 0x1f5ecfae600801
            ...
            zio_delta = 0x0
            zio_type = 0x2 [WRITE]
            zio_priority = 0x3 [ASYNC_WRITE]
            zio_objset = 0x0

Document zio_type and zio_priority.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17381
2025-05-30 10:29:29 -04:00
Rob Norris
4653e2f7d3 dmu_tx: break tx assign/wait when pool suspends
This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool
suspends while they're waiting, rather than just on the initial check
before falling back into a wait.

Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore
suspend entirely, effectively returning to the previous behaviour.

With that, it shouldn't be possible for anything with a standard
dmu_tx_assign/wait/abort loop to block under failmode=continue.

Also should be a bit tighter than the old behaviour, where a
VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool
is already suspended and failmode=continue.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:28:51 -07:00
Rob Norris
ac2e579521 dmu_tx: make DMU_TX_* flags an enum
Mostly for a little more type checking and debugging visibility.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:28:46 -07:00
Rob Norris
468d22d60c txg_wait_synced_flags: add TXG_WAIT_SUSPEND flag to not wait if pool suspended
This allows a caller to request a wait for txg sync, with an appropriate
error return if the pool is suspended or becomes suspended during the
wait.

To support this, txg_wait_kick() is added to signal the sync condvar,
which wakes up the waiters, causing them to loop and reconsider their
wait conditions again. zio_suspend() now calls this to trigger the break
if the pool suspends while waiting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:27:46 -07:00
Rob Norris
06fa8f3f69
zfs_cmd: reorganise zfs_cmd_t to match original size
2aa3fbe761 extended zinject_record_t, and in doing so inadvertently
extended zfs_cmd_t, which broke compatibility with userspace tools
without the change.

This fixes that by using some of the unused space in zfs_cmd_t for the
extra fields. We also add an assert to trigger a compile error if the
size ever changes.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17367
2025-05-27 20:01:06 -04:00
Ameer Hamza
2a91d577b1
Expose dataset encryption status via fast stat path
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17368
2025-05-26 22:11:03 -04:00
Alexander Motin
d5616ad34a
Increase meta-dnode redundancy in "some" mode
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset.  It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.

This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17339
2025-05-16 13:23:32 -04:00
Alexander Motin
89a8a91582
ARC: Notify dbuf cache about target size reduction
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too.  Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17314
2025-05-14 10:34:14 -04:00
Alexander Motin
734eba251d
Wire O_DIRECT also to Uncached I/O (#17218)
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O.  It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable.  Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations.  It require I/Os
to be page aligned, does not allow speculative prefetch, etc.  The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O.  As such it should fill the gap in between.  Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number
of APIs.  But as side effect upper layers can now control not only
the caching, but also speculative prefetch.  I haven't wired it to
VFS yet, since it require looking on some OS specifics.  But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes #17027
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-05-13 14:26:55 -07:00
Rob Norris
b2284aedab
metaslab_alloc: make hint BP and DVA const (#17324)
Nothing modifies them, and nothing should, so lets try to enforce that.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
2025-05-12 10:52:46 -07:00
Alexander Motin
49fbdd4533
Introduce zfs rewrite subcommand (#17246)
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-05-12 10:22:17 -07:00
Fedor Uporov
1a8f5ad3b0
zvol: Enable zvol threading functionality on FreeBSD
Make zvol I/O requests processing asynchronous on FreeBSD side in some
cases. Clone zvol threading logic and required module parameters from
Linux side. Make zvol threadpool creation/destruction logic shared for
both Linux and FreeBSD.
The IO requests are processed asynchronously in next cases:
- volmode=geom: if IO request thread is geom thread or cannot sleep.
- volmode=cdev: if IO request passed thru struct cdevsw .d_strategy
routine, mean is AIO request.
In all other cases the IO requests are processed synchronously. The
volthreading zvol property is ignored on FreeBSD side.

Sponsored-by: vStack, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: @ImAwsumm
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17169
2025-05-08 15:25:40 -04:00
Alan Somers
f13d760aa8
Delete dead code: dbuf_loan_arcbuf
It's been dead ever since 5fa356ea44

Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17119
2025-05-08 10:34:11 -04:00
Paul Dagnelie
246e5883bb
Implement allocation size ranges and use for gang leaves (#17111)
When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.

We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-05-02 15:32:18 -07:00
Rob Norris
a7de203c86
txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.

This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.

Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.

The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2025-05-02 15:29:50 -07:00
Rob Norris
c8fa39b46c
cred: properly pass and test creds on other threads (#17273)
### Background

Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.

Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.

However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.

### Current implementation

The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.

For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.

### Changes

We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.

On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-29 16:27:48 -07:00
Paul Dagnelie
5f5321effa
Handle interaction between gang blocks, copies, and FDT.
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes: #17123
2025-04-21 11:26:30 -04:00
Tony Hutter
8d1489735b
nvlist: Add nvlist_snprintf() and zfs_dbgmsg_nvlist()
Add nvlist_snprintf() to print a nvlist to a buffer.  This is basically
the snprintf() version of dump_nvlist().  Along with that, add a
zfs_dbgmsg_nvlist() to print out an nvlist to dbgmsg.  This will aid in
debugging.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #17215
2025-04-18 09:22:16 -04:00
Paul Dagnelie
b14b3e3985
Fix FDT rollback to not overwrite unnecessary fields (#17205)
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-04 11:10:44 -07:00
Ameer Hamza
6f6c504700 Show default quotas in zfs userspace tools
Update zfs userspace, groupspace, and projectspace to display the
default quotas when no per-ID specific quota is configured. This
ensures tool outputs align with enforced limits.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:45 -07:00
Ameer Hamza
2a8d9d9607 Add default user/group/project quota properties
This adds default userquota, groupquota, and projectquota properties to
MASTER_NODE_OBJ to make them accessible during zfsvfs_init() (regular
DSL properties require dsl_config_lock, which cannot be safely acquired
in this context). The zfs_fill_zplprops_impl() logic is updated to read
these default properties directly from MASTER_NODE_OBJ.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:35:22 -07:00
Paul Dagnelie
367d34b3aa
Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17183
2025-04-01 09:23:43 -04:00