mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-13 05:46:17 +03:00

Author	SHA1	Message	Date
Alexander Motin	ba227e2cc2	Make TX abort after assign safer It is not right, but there are few examples when TX is aborted after being assigned in case of error. To handle it better on production systems add extra cleanup steps. While here, replace couple dmu_tx_abort() in simple cases. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17438	2025-06-10 09:30:06 -07:00
Alexander Motin	bcd0430236	Allow zero compression if dedup is enabled Having high-refcount dedup entries for zero blocks is inefficient when they could be recorded as a holes instead. Normally, zero compression is not done if compression is disabled to not confuse naive benchmarks. But with dedup enabled, it is expected that the write will be skipped anyway, so we are just optimizing the way it is skipped. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17435	2025-06-10 09:28:14 -07:00
Mariusz Zaborski	46b82de618	scrub: generate scrub_finish event The `scn_min_txg` can now be used not only with resilver. Instead of checking `scn_min_txg` to determine whether it’s a resilver or a scrub, simply check which function is defined. Thanks to this change, a scrub_finish event is generated when performing a scrub from the saved txg. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #17432	2025-06-06 22:43:10 -04:00
Rob Norris	af7d609592	zpl: handle suspend from two remaining calls to `txg_wait_synced()` * zfs_link: allow tempfile sync to fail if pool suspends `4653e2f7d3` (#17355) allows dmu_tx_assign() to fail if the pool suspends when failmode=continue, but zfs_link() can fall back to txg_wait_synced() if it has to wait for a tempfile to be fully created before continuing, which will block if the pool suspends. Handle this by requesting an error return if the pool suspends when failmode=continue, and if that happens, return EIO. * zfs_clone_range: allow dirty wait to fail if pool suspends `4653e2f7d3` (#17355) allows dmu_tx_assign() to fail if the pool suspends when failmode=continue, but zfs_clone_range() can fall back to txg_wait_synced() if it has to wait for a dirty block to be written out, which will block if the pool suspends. Handle this by requesting an error return if the pool suspends when failmode=continue, and if that happens, return EIO. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17413	2025-06-05 15:38:26 -04:00
Alexander Motin	b7f919d228	Relax zfs_vnops_read_chunk_size limitations It makes no sense to limit read size below the block size, since DMU will any way consume resources for the whole block, while the current zfs_vnops_read_chunk_size is only 1MB, which is smaller that maximum block size of 16MB. Plus in case of misaligned Uncached I/O the buffer may get evicted between the chunks, requiring repeating I/Os. On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB. It allows to less depend on speculative prefetcher if application requests specific size, first not waiting for prefetcher to start and later not prefetching more than needed. Also while there, we don't need to align reads to the chunk size, but only to a block size, which is smaller and so more forgiving. My profiles show ~4% of CPU time saving when reading 16MB blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17415	2025-06-04 11:24:15 -04:00
Alexander Motin	68817d28c5	Include class name into struct metaslab_class With increasing number of metaslab classes it can be helpful for debugging to know what we are looking at. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17409	2025-06-03 11:12:59 -04:00
Alexander Motin	108562344c	Improve allocation fallback handling Before this change in case of any allocation error ZFS always fallen back to normal class. But with more of different classes available we migth want more sophisticated logic. For example, it makes sense to fall back from dedup first to special class (if it is allowed to put DDT there) and only then to normal, since in a pool with dedup and special classes populated normal class likely has performance characteristics unsuitable for dedup. This change implements general mechanism where fallback order is controlled by the same spa_preferred_class() as the initial class selection. And as first application it implements the mentioned dedup->special->normal fallbacks. I have more plans for it later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17391	2025-05-31 19:12:16 -04:00
Fedor Uporov	e0edfcbd4e	ZVOL: Make zvol_volmode module parameter platform-independent The module parameter name was not changed in FreeBSD sysctls list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is: /sys/module/zfs/parameters/zvol_volmode. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17386	2025-05-31 19:09:50 -04:00
Fedor Uporov	e1677d9ee1	ZVOL: Make zvol_prefetch_bytes module parameter platform-independent The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072, same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17385	2025-05-31 09:58:54 -04:00
Rob Norris	e8e602d987	zio_add_child: collapse into a single function The child locking difference is simple enough to handle with a boolean. The actual work is more involved, and it's easy to forget to change things in both places when experimenting. Just collapse them. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17382	2025-05-30 21:18:10 -04:00
Alexander Motin	2d33c8edb6	Make rewrite use Uncached I/O Rewrite is a one-time/rare bulk administrative operation, which should minimally affect payload caching. Plus some avoided memory copies in its data path allow to significantly increase its speed. My tests show reduction of time to rewrite 28GB of uncompressed files on NVMe pool from 17 to 9 seconds and minimal ARC usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17407	2025-05-30 21:13:49 -04:00
Fedor Uporov	a38376b37a	Rename zvol kernel module parameters sysctls on FreeBSD side Make 'zvol_threads', 'zvol_num_taskqs' and 'zvol_request_sync' names compatible with FreeBSD sysctl naming convention. Now the sysctls are have a next form: $ sysctl vfs.zfs.vol.threads vfs.zfs.vol.threads: 0 $ sysctl vfs.zfs.vol.num_taskqs vfs.zfs.vol.num_taskqs: 0 $ sysctl vfs.zfs.vol.request_sync vfs.zfs.vol.request_sync: 0 Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17406	2025-05-30 13:41:15 -07:00
Alexander Motin	008c9666ef	Set spa_final_txg in spa_unload() I've noticed that after some dedup tests system reboot ends up in assertion about ms_defer tree not free. It seems to be caused by DDT flushing still freeing some blocks while ZFS trying to reach a final steady state due to spa_final_txg, while being set by spa_export_common() on pool export, is not set when spa_unload() is called by spa_evict_all() on system reboot/shutdown. Setting spa_final_txg in spa_unload() fixes this issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17395	2025-05-30 14:44:45 -04:00
Ameer Hamza	b3b3cd1e4f	vdev: skip faulting disks pending removal This patch fixes a race where vdev_remove_wanted may be set after probe initiation, which could otherwise trigger redundant fault and removal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17400	2025-05-30 09:14:37 -07:00
Rob Norris	44e3266894	events: include zio type in IO error reports Usually the IO type can be inferred from the other fields (in particular, priority and flags) sometimes it's not easy to see. This is just another little debug helper. May 27 2025 00:54:54.024110493 ereport.fs.zfs.data class = "ereport.fs.zfs.data" ena = 0x1f5ecfae600801 ... zio_delta = 0x0 zio_type = 0x2 [WRITE] zio_priority = 0x3 [ASYNC_WRITE] zio_objset = 0x0 Document zio_type and zio_priority. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17381	2025-05-30 10:29:29 -04:00
Fedor Uporov	3dfa98d013	ZVOL: Make zvol_inhibit_dev module parameter platform-independent The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17384	2025-05-29 09:37:41 -04:00
Alexander Motin	fa697b94e6	FreeBSD: Add posix_fadvise(POSIX_FADV_WILLNEED) support As commit `320f0c6` did for Linux, connect POSIX_FADV_WILLNEED up to dmu_prefetch() on FreeBSD. While there, fix portability problems in tests/functional/fadvise. 1. Instead of relying on the numerical values of POSIX_FADV_XXX macros, accept macro names as arguments to the file_fadvise program. (The numbers happen to match on Linux and FreeBSD, but future systems may vary and it seems a little strange/raw to count on that.) 2. For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD VFS currently (perhaps something that should be investigated in FreeBSD). Since on Linux we're treating SEQUENTIAL and WILLNEED the same, it doesn't really matter which one we use, so switch the test over to WILLNEED exercise the new prefetch code on both OSes the same way. Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Thomas Munro <tmunro@FreeBSD.org> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17379	2025-05-29 09:34:07 -04:00
Rob Norris	00360efa35	tunables: fix spelling Three occurences with an 'e', and all of them mine. Maybe it's an British thing? Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	b0e053a10d	tunables: ensure tunable and variable have same define gate If a variable is only available in the kernel, then the tunable should also only be available there. This matters very little so long as we don't have userspace tunables, but its still good hygeine. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	d1724b59dc	tunables: don't assert initialisation in impl getters It actually doesn't matter if it's not initialised when we first query the current value; it just returns empty-string. A crash is quite obnoxious even if it is a rare case. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	5ef91c2bee	zfs_log: make zfs_immediate_write_sz uint Likely it's only int64 for comparison with ssize_t, which is signed. However, it would make no sense for it to be less than 0 or greater than 4G, so making it a regular uint will make it safe for comparison and remove the only S64 tunable in core. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Paul Dagnelie	c464f1d014	Only interrupt active disk I/Os in failmode=continue failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17372	2025-05-28 15:31:32 -07:00
Rob Norris	55d035e866	dmu_tx_assign: make all VERIFY0 calls use DMU_TX_SUSPEND This is the cheap way to keep non-user functions working after break-on-suspend becomes default. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:28:59 -07:00
Rob Norris	4653e2f7d3	dmu_tx: break tx assign/wait when pool suspends This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool suspends while they're waiting, rather than just on the initial check before falling back into a wait. Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore suspend entirely, effectively returning to the previous behaviour. With that, it shouldn't be possible for anything with a standard dmu_tx_assign/wait/abort loop to block under failmode=continue. Also should be a bit tighter than the old behaviour, where a VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool is already suspended and failmode=continue. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:28:51 -07:00
Rob Norris	ac2e579521	dmu_tx: make DMU_TX_* flags an enum Mostly for a little more type checking and debugging visibility. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:28:46 -07:00
Rob Norris	468d22d60c	txg_wait_synced_flags: add TXG_WAIT_SUSPEND flag to not wait if pool suspended This allows a caller to request a wait for txg sync, with an appropriate error return if the pool is suspended or becomes suspended during the wait. To support this, txg_wait_kick() is added to signal the sync condvar, which wakes up the waiters, causing them to loop and reconsider their wait conditions again. zio_suspend() now calls this to trigger the break if the pool suspends while waiting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:27:46 -07:00
Pavel Snajdr	8487945034	zcp: get_prop: fix encryptionroot and encryption It was reported that channel programs' zfs.get_prop doesn't work for dataset properties encryption and encryptionroot. They are handled in get_special_prop due to the need to call dsl_dataset_crypt_stats to load those dsl props. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Co-authored-by: Graham Christensen <graham@grahamc.com> Closes #17280	2025-05-27 20:04:37 -04:00
Rob Norris	284580c878	dmu_traverse: remove 'ignore_hole_birth' tunable alias It's been many years, we can probably do without. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17376	2025-05-27 15:05:09 -07:00
Ameer Hamza	2a91d577b1	Expose dataset encryption status via fast stat path In truenas_pylibzfs, we query list of encrypted datasets several times, which is expensive. This commit exposes a public API zfs_is_encrypted() to get encryption status from fast stat path without having to refresh the properties. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17368	2025-05-26 22:11:03 -04:00
Don Brady	b048bfa9c1	Allow opt-in of zvol blocks in special class Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #14876	2025-05-24 16:44:26 -04:00
Alexander Motin	9d76950d67	ZIL: Improve write log size accounting Before this change write log size TXG throttling mechanism was accounting only user payload bytes. But the actual ZIL both on disk and especially in memory include headers of hundred(s) of bytes. Not accouting those may allow applications like bonnie++, in their wisdom writing one byte at a time, to consume excessive amount of memory and ZIL/SLOG in one TXG. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17373	2025-05-23 21:48:46 -04:00
Paul Dagnelie	ddf28f27c5	Fix off-by-one bug in range tree code Without this fix, zfs_range_tree_find_in could return an overlap when the found range starts immediately after the searched range, with no actual overlap. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17363	2025-05-23 10:33:33 -04:00
Alexander Motin	5c30b24381	Fix null dereference in spa_vdev_remove_cancel_sync() We don't really need to access space map to know where the metaslab ends, while msp->ms_sm might be NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17164 Fixes #17359 Closes #17361	2025-05-22 10:47:43 -04:00
Richard Yao	83fa80a550	dmu_objset_hold_flags() should call dsl_dataset_rele_flags() on error This was caught when doing a manual check to see if #17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in #17340. Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard@ryao.dev> Closes #17353	2025-05-20 08:35:45 -07:00
George Amanakis	ea74cdedda	Fix 2 bugs in non-raw send with encryption Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <hbarta@gmail.com> Contributions-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12014 Closes #17340	2025-05-19 09:55:00 -07:00
Alexander Motin	e55225be3e	Add explicit DMU_DIRECTIO checks UIO_DIRECT means we can do Direct I/O, while DMU_DIRECTIO we want to do it. First does not automatically means second. Add few checks to not use Direct I/O in few cases we don't want it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17342	2025-05-16 23:27:05 -04:00
Alexander Motin	d5616ad34a	Increase meta-dnode redundancy in "some" mode Loss of one indirect block of the meta dnode likely means loss of the whole dataset. It is worse than one file that the man page promises, and in my opinion is not much better than "none" mode. This change restores redundancy of the meta-dnode indirect blocks, while same time still corrects expectations in the man page. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17339	2025-05-16 13:23:32 -04:00
Paul Dagnelie	086105f4c4	Cause zpool scan resume commands to get logged in history Currently, commands that resume a scrub/errorscrub from a paused state don't get logged in the pool history. This is because resumes actually return ECANCELED, instead of 0. This causes the tsd code in the common ioctl logic to not think the ioctl succeeded, which causes the log_history ioctl to fail with EPERM. However, for resuming a scrub from a paused state, ECANCELED is success. There are two options for how to deal with this. The first is the one that I implemented here; I can't find a good reason for dmu_scan to return ECANCELED on resume instead of 0, so let's just not. The only place we check for the ECANCELED value is in zpool_scan, where we just convert it back to zero. However, I am aware that this is changing an ioctl interface, which I believe is a breaking change. I don't think it's an important change, but maybe there is someone who relies on it. The other option that could be implemented is to either allow ECANCELED specifically from dsl_scan in the common ioctl code, or add a generic facility to the common ioctl code that allows each command to specify whether or not success happened, regardless of the return values. I am open to feedback on which option people think would be better. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17301	2025-05-16 13:19:04 -04:00
Allan Jude	b6916f995e	ARC: parallel eviction On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Closes #16486	2025-05-14 10:38:32 -04:00
Alexander Motin	89a8a91582	ARC: Notify dbuf cache about target size reduction ARC target size might drop significantly under memory pressure, especially if current ARC size was much smaller than the target. Since dbuf cache size is a fraction of the target ARC size, it might need eviction too. Aside of memory from the dbuf eviction itself, it might help ARC by making more buffers evictable. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17314	2025-05-14 10:34:14 -04:00
Alexander Motin	734eba251d	Wire O_DIRECT also to Uncached I/O (#17218 ) Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17027 Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-05-13 14:26:55 -07:00
Rob Norris	b2284aedab	metaslab_alloc: make hint BP and DVA const (#17324 ) Nothing modifies them, and nothing should, so lets try to enforce that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru>	2025-05-12 10:52:46 -07:00
Alexander Motin	49fbdd4533	Introduce zfs rewrite subcommand (#17246 ) This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-05-12 10:22:17 -07:00
Mariusz Zaborski	8b9c4e643b	spa: clear checkpoint information during retry Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes #17319	2025-05-11 12:49:38 -04:00
Alan Somers	c17bdc4914	More aggressively assert that db_mtx protects db.db_data db.db_mtx must be held any time that db.db_data is accessed. All of these functions do have the lock held by a parent; add assertions to ensure that it stays that way. See https://github.com/openzfs/zfs/discussions/17118 * Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't required. * Refactor dbuf_hold_copy to eliminate the db_rwlock Copy data into the newly allocated buffer before assigning it to the db. That way, there will be no need to take db->db_rwlock. * Refactor dbuf_read_hole In the case of an indirect hole, initialize the newly allocated buffer before assigning it to the dmu_buf_impl_t. Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17209	2025-05-09 09:02:26 -04:00
Fedor Uporov	1a8f5ad3b0	zvol: Enable zvol threading functionality on FreeBSD Make zvol I/O requests processing asynchronous on FreeBSD side in some cases. Clone zvol threading logic and required module parameters from Linux side. Make zvol threadpool creation/destruction logic shared for both Linux and FreeBSD. The IO requests are processed asynchronously in next cases: - volmode=geom: if IO request thread is geom thread or cannot sleep. - volmode=cdev: if IO request passed thru struct cdevsw .d_strategy routine, mean is AIO request. In all other cases the IO requests are processed synchronously. The volthreading zvol property is ignored on FreeBSD side. Sponsored-by: vStack, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: @ImAwsumm Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17169	2025-05-08 15:25:40 -04:00
Alan Somers	f13d760aa8	Delete dead code: dbuf_loan_arcbuf It's been dead ever since `5fa356ea44` Sponsored by: ConnectWise Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17119	2025-05-08 10:34:11 -04:00
Alexander Motin	b1ccab1721	ARC: Avoid overflows in arc_evict_adj() (#17255 ) With certain combinations of target ARC states balance and ghost hit rates it was possible to get the fractions outside of allowed range. This patch limits maximum balance adjustment speed, which should make it impossible, and also asserts it. Fixes #17210 Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-05-06 09:31:38 -07:00
Paul Dagnelie	246e5883bb	Implement allocation size ranges and use for gang leaves (#17111 ) When forced to resort to ganging, ZFS currently allocates three child blocks, each one third of the size of the original. This is true regardless of whether larger allocations could be made, which would allow us to have fewer gang leaves. This improves performance when fragmentation is high enough to require ganging, but not so high that all the free ranges are only just big enough to hold a third of the recordsize. This is also useful for improving the behavior of a future change to allow larger gang headers. We add the ability for the allocation codepath to allocate a range of sizes instead of a single fixed size. We then use this to pre-allocate the DVAs for the gang children. If those allocations fail, we fall back to the normal write path, which will likely re-gang. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-05-02 15:32:18 -07:00
Rob Norris	a7de203c86	txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284 ) txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We expect that future development will require similar "wait unless X" behaviour. This generalises the API as txg_wait_synced_flags(), where the provided flags describe the events that should cause the call to return. Instead of a boolean, the return is now an error code, which the caller can use to know which event caused the call to return. The existing call to txg_wait_synced_sig() is now txg_wait_synced_flags(TXG_WAIT_SIGNAL). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-05-02 15:29:50 -07:00

1 2 3 4 5 ...

3392 Commits