mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 19:04:45 +03:00

Author	SHA1	Message	Date
Rob Norris	65b4a5c551	Linux 7.1: access dentry d_alias directly The d_u union introduced in 3.18 is now anonymous, so we need to detect it and decide the right way to name d_alias. Note that we used to have support for both names to support kernels before 3.18, so this commit is effectively reverting the commit that removed that support, `efc293e371`. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18471	2026-05-04 10:38:46 -07:00
Andriy Tkachuk	76fd64ac9f	Fix rare cksum errors after rebuild Currently, after rebuild (aka sequential resilver), checksum errors can be seen sometimes on the spare vdev or draid spare. On my laptop, it happens from 2 to 4 times of running redundancy_draid_spare1 test in a loop for 100 times. It looks like there's a race in vdev_rebuild_thread() when the rebuild of space map ranges is finished and we re-enable allocations from the metaslab too soon: a new allocations may happen from that metaslab before txg with the rebuilt ranges is sync-ed, causing undesirable interference. Solution: wait for the txg to be sync-ed before enabling metaslab. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com> Closes #18307 Closes #18319 Closes #18473	2026-05-04 10:38:46 -07:00
Alek P	7590972f76	Prevent range tree corruption race by updating dnode_sync() Switch to incremental range tree processing in dnode_sync() to avoid unsafe lock dropping during zfs_range_tree_walk(). This also ensures the free ranges remain visible to dnode_block_freed() throughout the sync process, preventing potential stale data reads. This patch: - Keeps the range tree attached during processing for visibility. - Processes segments one-by-one by restarting from the tree head. - Uses zfs_range_tree_clear() to safely handle ranges that may have been modified while the lock was dropped. - adds ASSERT()s to document that we don't expect dn_free_ranges modification outside of sync context. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Issue #18186 Closes #18235	2026-04-27 10:57:52 -07:00
Andriy Tkachuk	da44040bbb	draid: fix cksum errors after rebuild with degraded disks Currently, when more than nparity disks get faulted during the rebuild, only first nparity disks would go to faulted state, and all the remaining disks would go to degraded state. When a hot spare is attached to that degraded disk for rebuild creating the spare mirror, only that hot spare is getting rebuilt, but not the degraded device. So when later during scrub some other attached draid spare happens to map to that spare, it will end up with cksum error. Moreover, if the user clears the degraded disk from errors, the data won't be resilvered to it, hot spare will be detached almost immediately and the data that was resilvered only to it will be lost. Solution: write to all mirrored devices during rebuild, similar to traditional/healing resilvering, but only if we can verify the integrity of the data, or when it's the draid spare we are writing to, in which case we are writing to a reserved spare space, and there is no danger to overwrite any good data. The argument that writing only to rebuilding draid spare vdev is faster than writing to normal device doesn't hold since, at a specific offset being rebuilt, draid spare will be mapped to a normal device anyway. redundancy_draid_degraded2 automation test is added also to cover the scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Tkachuk <atkachuk@wasabi.com> Closes #18414	2026-04-23 15:00:46 -07:00
Rob Norris	fc285caa84	linux/vfsops: remove zfs_mnt_t, pass directly A cleanup of opportunity. Since we already are modifying the contents of zfs_mnt_t, we've broken any API guarantee, so we might as well go the rest of the way and get rid of it, and just pass the osname and/or the vfs_t directly. It seems like zfs_mnt_t was never really needed anyway; it was added in `1c2555ef92` (March 2017) to minimise the difference to illumos, but zfs_vfsops was made platform-specific anyway in `7b4e27232d`. We also remove setting SB_RDONLY on the caller's flags when failing a read-write remount on a read-only snapshot or pool. Since `0f608aa6ca` the caller's flags have been a pointer back to fc->sb_flags, which are discarded without further ceremony when the operation fails, so the change is unnecessary and we can simplify the call further. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-23 14:59:31 -07:00
Rob Norris	43eed9ee41	linux/super: match vfs_t lifetime to fs_context vfs_t is initially just parameters for the mount or remount operation, so match them to the lifetime of the fs_context that represents that operation. When we actually execute the operation (calling .get_tree or .reconfigure), transfer ownership of those options to the associated zfsvfs_t. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-23 14:59:18 -07:00
Rob Norris	7843c42b27	linux/vfsops: add vfs_t allocator, make public In a few commits, we're going to need to allocate and free vfs_t from zpl_super.c as well, so lets keep them uniform. Sponsored-by: TrueNAS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18377	2026-04-23 14:59:02 -07:00
Andriy Tkachuk	938c8c98b1	draid: fix data corruption after disk clear Currently, when there there are several faulted disks with attached dRAID spares, and one of those disks is cleared from errors (zpool clear), followed by its spare being detached, the data in all the remaining spares that were attached while the cleared disk was in FAULTED state might get corrupted (which can be seen by running scrub). In some cases, when too many disks get cleared at a time, this can result in data corruption/loss. dRAID spare is a virtual device whose blocks are distributed among other disks. Those disks can be also in FAULTED state with attached spares on their own. When a disk gets sequentially resilvered (rebuilt), the changes made by that resilvering won't get captured in the DTL (Dirty Time Log) of other FAULTED disks with the attached spares to which the data is written during the resilvering (as it would normally be done for the changes made by the user if a new file is written or some existing one is deleted). It is because sequential resilvering works on the block level, without touching or looking into metadata, so it doesn't know anything about the old BPs or transactions groups that it is resilvering. So later on, when that disk gets cleared from errors and healing resilvering is trying to sync all the data from its spare onto it, all the changes made on its spare during the resilvering of other disks will be missed because they won't be captured in its DTL. That's why other dRAID spares may get corrupted. Here's another way to explain it that might be helpful. Imagine a scenario: 1. d1 fails and gets resilvered to some spare s1 - OK. 2. d2 fails and gets sequentially resilvered on draid spare s2. Now, in some slices, s2 would map to d1, which is failed. But d1 has s1 spare attached, so the data from that resilvering goes to s1, but not recorded in d1's DTL. 3. Now, d1 gets cleared and its s1 gets detached. All the changes done by the user (writes or deletions) have their txgs captured in d1's DTL, so they will be resilvered by the healing resilver from its spare (s1) - that part works fine. But the data which was written during resilvering of d2 and went to s1 - that one will be missed from d1's DTL and won't get resilvered to it. So here we are: 4. s2 under d2 is corrupted in the slices which map to d1, because d1 doesn't have that data resilvered from s1. Now, if there are more failed disks with draid spares attached which were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and so on - all their spares will be corrupted. Because, in some slices, each of them will map to d1 which will miss their data. Solution: add all known txgs starting from TXG_INITIAL to DTLs of non-writable devices during sequential resilvering so when healing resilver starts on disk clear, it would be able to check and heal blocks from all txgs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18286 Closes #18294	2026-04-23 14:54:23 -07:00
Andriy Tkachuk	33961142a2	Fix deadlock on dmu_tx_assign() from vdev_rebuild() vdev_rebuild() is always called with spa_config_lock held in RW_WRITER mode. However, when it tries to call dmu_tx_assign() the latter may hang on dmu_tx_wait() waiting for available txg. But that available txg may not happen because txg_sync takes spa_config_lock in order to process the current txg. So we have a deadlock case here: - dmu_tx_assign() waits for txg holding spa_config_lock; - txg_sync waits for spa_config_lock not progressing with txg. Here are the stacks: __schedule+0x24e/0x590 schedule+0x69/0x110 cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] dmu_tx_wait+0x8e/0x1e0 [zfs] dmu_tx_assign+0x49/0x80 [zfs] vdev_rebuild_initiate+0x39/0xc0 [zfs] vdev_rebuild+0x84/0x90 [zfs] spa_vdev_attach+0x305/0x680 [zfs] zfs_ioc_vdev_attach+0xc7/0xe0 [zfs] cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] spa_config_enter+0xf9/0x120 [zfs] spa_sync+0x6d/0x5b0 [zfs] txg_sync_thread+0x266/0x2f0 [zfs] The solution is to pass txg returned by spa_vdev_enter(spa) at the top of spa_vdev_attach() to vdev_rebuild() and call dmu_tx_create_assigned(txg) which doesn't wait for txg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18210 Closes #18258	2026-04-23 14:54:14 -07:00
Rob Norris	ffa0a5af30	Linux 7.0: posix_acl_to_xattr() now allocates memory Kernel devs noted that almost all callers to posix_acl_to_xattr() would check the ACL value size and allocate a buffer before make the call. To reduce the repetition, they've changed it to allocate this buffer internally and return it. Unfortunately that's not true for us; most of our calls are from xattr_handler->get() to convert a stored ACL to an xattr, and that call provides a buffer. For now we have no other option, so this commit detects the new version and wraps to copy the value back into the provided buffer and then free it. Sponsored-by: TrueNAS Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@truenas.com> Closes #18216	2026-04-23 14:31:09 -07:00
Rob Norris	fc44c73021	build: add SPDX license tags to build system files Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18077	2026-04-23 14:29:46 -07:00
Alexander Motin	3dcd071b51	Fix available space accounting for special/dedup (#18222 ) Currently, spa_dspace (base to calculate dataset AVAIL) only includes the normal allocation class capacity, but dd_used_bytes tracks space allocated across all classes. Since we don't want to report free space of other classes as available (we can't promise new allocations will be able to use it), report only allocated space, similar to how we report space saved by dedup and block cloning. Since we need deflated space here, make allocation classes track deflated allocated space also. While here, make mc_deferred also deflated, matching its use contexts. Also while there, use atomic_load() to read the allocation class stats. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18190 Closes #18222	2026-02-19 11:14:37 -08:00
Alexander Motin	25327ed7ce	Improve caching for dbuf prefetches To avoid read errors with transaction open dmu_tx_check_ioerr() is used to read everything required in advance. But there seems to be a chance for the buffer to evicted from dbuf cache in between, which result in immediate eviction from ARC, which may require additional disk read later in a place where error handling is problematic. To partially workaround this introduce a new flag DMU_IS_PREFETCH, relayed to ARC as ARC_FLAG_PREFETCH \| ARC_FLAG_PRESCIENT_PREFETCH, making ARC delay eviction by at least several seconds, or till the actual read inside the transaction, that will promote it to demand access. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18160	2026-02-17 11:54:58 -08:00
Brian Behlendorf	c710f87923	mmp: claim sequence id before final import As part of SPA_LOAD_IMPORT add an additional activity check to detect simultaneous imports from different hosts. This check is only required when the timing is such that there's no activity for the the read-only tryimport check to detect. This extra safety chceck operates as follows: 1. Repeats the following MMP check 10 times: a. Write out an MMP uberblock with the best txg and a random sequence id to all primary pool vdevs. b. Verify a minimum number of good writes such that even if the pool appears degraded on the remote host it will see at least one of the updated MMP uberblocks. c. Wait for the MMP interval this leaves a window for other racing hosts to make similar modifications which can be detected. d. Call vdev_uberblock_load() to determine the best uberblock to use, this should be the MMP uberblock just written. e. Verify the txg and random sequeunce number match the MMP uberblock written in 1a. 2. Restore the original MMP uberblocks. This allows the check to be performed again if the pool fails to import for an unrelated reason. This change also includes some refactoring and minor improvements. - Never try loading earlier txgs during import when the import fails with EREMOTEIO or EINTER. These errors don't indicate the txg is damaged but instead that its either in use on a remote host or the import was interactively cancelled. No rewind is also performed for EBADD which can result from a stale trusted config when doing a verbatim import. - Refactor the code for consistent logging of the multihost activity check using spa_load_note() and console messages indicating when the activity check was trigger and the result. - Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier modification of the sequence number in an uberblock. - Added ZFS_LOAD_INFO_DEBUG environment variable which can be set to log to dump to stdout the spa_load_info nvlist returned during import. This is used by the updated mmp test cases to determine if an activity check was run and its result. - Standardize the mmp messages similarly to make it easier to find all the relevent mmp lines in the debug log. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Brian Behlendorf	96ffe51004	mmp: add spa_load_name() for tryimport Tryimport adds a unique prefix to the pool name to avoid name collisions. This makes it awkward to log user-friendly info during a tryimport. Add a spa_load_name() function which can be used to report the unmodified pool name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com>	2026-02-10 17:01:29 -08:00
Erik Larsson	8a9bbaa7cf	Fix build for Linux 6.18 with PowerPC/RISC-V kernels. (#18145 ) The macro 'flush_dcache_page(...)' modifies the page flags, but in Linux 6.18 the type of the page flags changed from 'unsigned long' to the struct type 'memdesc_flags_t' with a single member 'f' which is the page flags field. Signed-off-by: Erik Larsson <catacombae@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2026-02-10 17:00:04 -08:00
Alexander Motin	2c9fec38d0	DDT: Add locking for table ZAP destruction Similar to BRT, DDT ZAP can be destroyed by sync context when it becomes empty. Respectively similar to BRT introduce RW-lock to protect open context methods from the destruction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18115	2026-02-05 13:48:31 -08:00
Dimitry Andric	6a9d7820e6	Rename several printf attributes declarations to __printf__ For kernel builds on FreeBSD, we redefine `__printf__` to `__freebsd_kprintf__`, to support FreeBSD kernel printf(9) extensions with clang. In OpenZFS various printf related functions are declared with `__attribute__((format(printf, X, Y)))`, so these won't work with the above redefinition. With clang 21 and higher, this leads to errors similar to: sys/contrib/openzfs/module/zfs/spa_misc.c:414:38: error: passing 'printf' format string where 'freebsd_kprintf' format string is expected [-Werror,-Wformat] 414 \| (void) vsnprintf(buf, sizeof (buf), fmt, adx); \| ^ Since attribute names can always be spelled with leading and trailing double underscores, rename these instances. Note that in the FreeBSD base system we usually use `__printflike` from `<sys/cdefs.h>`, but that does not apply to OpenZFS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #18095	2026-02-05 13:48:31 -08:00
Rob Norris	cb1833023f	kmem: don't add __GFP_COMP for KM_VMEM allocations It hasn't been necessary since Linux 3.13 (torvalds/linux@a57a49887e), and since 6.19 the kernel warns if you use it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2026-02-05 13:48:31 -08:00
Rob Norris	ccf956c2b3	Linux 6.19: replace i_state access with inode_state_read_once() Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18053	2026-02-05 13:48:31 -08:00
Alexander Motin	4ab2027f59	DDT: Add/use zap_lookup_length_uint64_by_dnode() Unlike other ZAP consumers due to compression DDT does not know how big entry it is reading from ZAP. Due to this it called zap_length_uint64_by_dnode() and zap_lookup_uint64_by_dnode(), each of which does full ZAP entry lookup. Introduction of the combined ZAP method dramatically reduces the CPU overhead and locks contention at DBUF layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18048	2026-02-05 13:48:31 -08:00
Alexander Motin	4905686e67	DDT: Switch to using ZAP _by_dnode() interfaces As was previously done for BRT, avoid holding/releasing DDT ZAP dnodes for every access. Instead hold the dnodes during all their life time, never releasing. While at this, add _by_dnode() interfaces for zap_length_uint64() and zap_count(), actively used by DDT code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18047	2026-02-05 13:48:31 -08:00
Alexander Motin	fa857113a3	DDT: Move logs searches out of the lock Postponing entry removal from the DDT log in case of hit till later single-threaded sync stage allows to make ddl_tree stable during multi-threaded ZIO processing stage. It allows to drop the DDT lock before the search instead of after, reducing the contention a lot. Actually ddt_log_update_entry() was already handling the case of entry present in the active log, so we only need to remove it from flushing log, if the entry happen to be there. My tests with parallel 4KB block writes show throughput increase from 480MB/s (122K blocks/s) to 827MB/s (212K blocks/s), even though still limited by the global DDT lock contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18044	2026-02-05 13:48:31 -08:00
Alexander Motin	2428043709	Improve async destroy processing timing Previous code effectively enforced that all async free ZIOs were _issued_ within the TXG timeout. But they could take forever to complete, especially if the required metadata were not in ARC. This patch introduces periodic waits every 2000 ZIOs, which should give at least somewhat reasonable TXG timings even for single HDD pools with empty ARC. And makes them complete within half of the TXG timeout, since we might still need time to sync DDT and BRT. While there, change zfs_max_async_dedup_frees semantics to include also clone and gang blocks, which are similar. Bump the default value from set long ago to be more forgiving to block cloning (still not having logs and benefiting from large TXGs), now that we have better working time limits. The limit now is a possible amount of dirty data produced by BRT updates. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18043	2026-02-05 13:48:31 -08:00
Alexander Motin	e865ddad5c	ZIO: ZIO_STAGE_DDT_WRITE is a blocking stage ddt_lookup() in zio_ddt_write() might require synchronous DDT ZAP read. Running it from interrupt taskq might lead to deadlock. Inclusion of ZIO_STAGE_DDT_WRITE into ZIO_BLOCKING_STAGES should hopefully fix that, even though I am not sure how I got there. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17981	2026-02-05 13:48:30 -08:00
Ameer Hamza	74bbdda1ef	Fix snapshot automount expiry cancellation deadlock A deadlock occurs when snapshot expiry tasks are cancelled while holding locks. The snapshot expiry task (snapentry_expire) spawns an umount process and waits for it to complete. Concurrently, ARC memory pressure triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the expiry task while holding locks. The umount process spawned by the expiry task blocks trying to acquire locks held by arc_prune, which is blocked waiting for the expiry task to complete. This creates a circular dependency: expiry task waits for umount, umount waits for arc_prune, arc_prune waits for expiry task. Fix by adding non-blocking cancellation support to taskq_cancel_id(). The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to reschedule the unmount, which needs to cancel any existing expiry task. It now uses non-blocking cancellation to avoid waiting while holding locks, breaking the deadlock by returning immediately when the task is already running. The per-entry se_taskqid_lock has been removed, with all taskqid operations now protected by the global zfs_snapshot_lock held as WRITER. Additionally, an se_in_umount flag prevents recursive waits when zfsctl_destroy() is called during unmount. The taskqid is now only cleared by the caller on successful cancellation; running tasks clear their own taskqid upon completion. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17941	2025-12-10 10:21:29 -08:00
Alexander Motin	e1f0baa546	FreeBSD: Remove HAVE_INLINE_FLSL use These macros are deprecated in FreeBSD kernel for several years, and unneeded for much longer. Instead, similar to Linux, let kernel let compiler do the right things. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #18004	2025-12-10 10:21:29 -08:00
Alexander Motin	a41ef36858	DDT: Reduce global DDT lock scope during writes Before this change DDT lock was taken 4 times per written block, and as effectively a pool-wide lock it can be highly congested. This change introduces a new per-entry dde_io_lock, protecting some fields during I/O ready and done stages, so that we don't need the global lock there. According to my write tests on 64-thread system with 4KB blocks this significantly reduce the global lock contention, reducing CPU usage from 100% to expected ~80%, and increasing write throughput by 10%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17960	2025-12-10 10:21:29 -08:00
Alexander Motin	a785ddc5f3	DDT: Switch to using wmsums for lookup stats ddt_lookup() is a very busy code under a highly congested global lock. Anything we can save here is very important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17980	2025-12-10 10:21:29 -08:00
Mariusz Zaborski	1e8c96d7d5	Add knob to disable slow io notifications Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that allows users to disable notifications for slow devices. This prevents ZED and/or ZFSD from degrading the pool due to slow I/O. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes 17477	2025-11-12 13:07:14 -08:00
Alexander Motin	41878d57ea	Add BRT support to zpool prefetch command Implement BRT (Block Reference Table) prefetch functionality similar to existing DDT prefetch. This allows preloading BRT metadata into ARC to improve performance for block cloning operations and frees of earlier cloned blocks. Make -t parameter optional. When omitted, prefetch all supported metadata types (both DDT and BRT now). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17890	2025-11-12 13:07:09 -08:00
Rob Norris	ac0bc4cc00	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-12 13:06:54 -08:00
Alexander Motin	e305c7d596	BRT: Fix ranges to blocks conversion math BRT_RANGESIZE_TO_NBLOCKS() takes number of ranges as its argument. To get number of blocks we should multiply it by the entry size, not divide by it, as it was due to missing parentheses. Before #17875 this could cause small memory corruptions for vdevs bigger than 64TB, but the change made the bug more noticeable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17886 Closes #17915	2025-11-12 13:06:48 -08:00
Alexander Motin	aaf374bd40	ZIO: Set minimum number of free issue threads to 32 Free issue threads might block waiting for synchronous DDT, BRT or GANG header reads. So unlike other taskqs using ZTI_SCALE to scale with number of CPUs, here we also need some amount of threads to potentially saturate pool reads. I am not sure we always want the 96 threads we had before ZTI_SCALE introduction at #11966 on small systems, but lets make it at least 32. While here, make free taskqs configurable, similar to read and write ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17903	2025-11-12 13:06:39 -08:00
Tony Hutter	a2a34d9212	Linux 6.17 compat: Fix broken projectquota on 6.17 We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_attr() codepaths, and the FS__FL macros in the zpl_ioctl_flags() codepaths. The earlier code just assumes the FS__FL macros for both codepaths. The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that exposed this error via failing 'projectquota' ZTS tests. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17884 Closes #17869	2025-11-12 13:06:01 -08:00
Alexander Motin	e3acd0a728	Fix caching of DDT log and BRT Both DDT log and BRT counters we read on pool import and then only append or overwrite in full blocks. We don't need them in DMU or ARC caches. Fortunately we have DMU_UNCACHEDIO for this now. Even more we don't need BRT in non-evictable metadata DMU caches, since it will likely never fit there, while block the cache from its original users. Since DMU_OT_IS_METADATA_CACHED() has no way to differentiate the new metadata types, mark BRT with storage type of DMU_OT_DDT_ZAP. As side effect it will also put it on dedup device, but that should actually be right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-11-12 13:05:25 -08:00
Alexander Motin	5847626175	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-11-12 13:04:58 -08:00
Rob Norris	aeff23939a	Linux 6.18: generic_drop_inode() and generic_delete_inode() renamed Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	3e7e19e028	pool_iter_refresh: don't refresh pools twice In "all pools" mode, pool_iter_refresh() will call zpool_iter(), which will call zpool_refresh_stats() before calling add_pool(). If we already have the pool, this is a different handle, so we just release it and return. Back in pool_iter_refresh(), we then call zpool_stats_refresh() again for our handle on the same pool. All together, this means we're doing two ZFS_IOC_POOL_STATS calls into the kernel for every pool in the system. This isn't wrong, but it does double the pressure on global locks. Instead, we add a new function zpool_refresh_stats_from_handle() that simply copies the pool config and state from one handle to another, and use it to update our handle before we release it in add_pool(), so we only have one call per pool per interval. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17807	2025-10-21 09:50:43 -07:00
Igor Ostapenko	01180a63bd	spa_config: Rename spa_config_enter_mmp() to spa_config_enter_priority() Originally this was created for MMP, but now new cases are emerging where the same mechanism is required. Hence the name's generalization. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17793	2025-10-21 09:50:43 -07:00
Robert Evans	ead0fb736d	zinject: Introduce ready delay fault injection This adds a pause to the ZIO pipeline in the ready stage for matching I/O (data, dnode, or raw bookmark). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #17787	2025-10-21 09:50:43 -07:00
hoshinomori	f3295ec763	range_tree: drop duplicate zfs_ prefix from rs_set_fill_raw Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: hoshinomori <hoshinomori@owarisekai.moe> Closes #17800	2025-09-29 16:50:53 -07:00
Tony Hutter	9079f986ae	zvol: Fix blk-mq sync The zvol blk-mq codepaths would erroneously send FLUSH and TRIM commands down the read codepath, rather than write. This fixes the issue, and updates the zvol_misc_fua test to verify that sync writes are actually happening. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17761 Closes #17765	2025-09-29 16:50:43 -07:00
Brian Behlendorf	954fe5e1be	Add interface to interface spa_get_worst_case_min_alloc() function Provide an interface to retrieve the lowest and highest minimum allocation size for the normal allocation class. This can be used by external consumers of the DMU to estimate potential wasted capacity when setting the recordsize for an object. The new "min_alloc" and "max_alloc" keys are added to the pool configuration and used by default_volblocksize() to warn when an ineffecient block size is requested. For older kmods which don't yet include the new keys fallback to the previous logic. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17758	2025-09-25 12:08:14 -07:00
Rob Norris	15a6b982c5	linux/super: add tunable to request immediate reclaim of unused dentries Traditionally, unused dentries would be cached in the dentry cache until the associated entry is no longer on disk. The cached dentry continues to hold an inode reference, causing the inode to be pinned (see previous commit). Here we implement the dentry op d_delete, which is roughly analogous to the drop_inode superblock op, and add a zfs_delete_dentry tunable to control its behaviour. By default it continues the traditional behaviour, but when the tunable is enabled, we signal that an unused dentry should be freed immediately, releasing its inode reference, and so allowing that inode to be deleted if no longer in use. Sponsored-by: Klara, Inc. Sponsored-by: Fastmail Pty Ltd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17746	2025-09-17 16:34:14 -07:00
Igor Ostapenko	1ca4cd8a33	Fix txg_log_time ZAP key typo Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Sponsored-by: Klara, Inc. Closes #17748	2025-09-15 12:44:01 -07:00
Allan Jude	6c4ede4026	ZFS allow send:encrypted A new `zfs allow` permissions that ONLY allows sending replication streams in raw (encrypted) mode, so encrypted data will not be decrypted as part of the replication process. Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Co-authored-by: JT Pennington <jt.pennington@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #17543	2025-09-12 15:05:02 -07:00
Paul Dagnelie	df55ba7c49	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:31:30 -07:00
Paul Dagnelie	e2e708241a	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 15:01:32 -07:00
Rob Norris	dc53e5c484	linux/rw_destroy: assert no holders before destroying While rw_destroy() may do nothing on Linux, we still want to make sure that we don't have any holders outstanding like we do for mutexes. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17718	2025-09-10 15:01:02 -07:00

1 2 3 4 5 ...

2669 Commits