mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Alexander Motin	a5b665df39	DDT: Switch to using wmsums for lookup stats ddt_lookup() is a very busy code under a highly congested global lock. Anything we can save here is very important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17980	2025-12-01 10:36:31 -08:00
Alexander Motin	48f33c1ef2	DDT: Make children writes inherit allocator Even though unlike gang children it is not so critical for dedup children to inherit parent's allocator, there is still no reason for them to have allocation policy different from normal writes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17961	2025-12-01 10:30:27 -08:00
Rob Norris	c631f5e6c2	Linux: bump -std to gnu11 Linux switched from -std=gnu89 to -std=gnu11 in 5.18 (torvalds/linux@e8c07082a8). We've always overridden that with gnu99 because we use some newer features. More recent kernels are using C11 features in headers that we include. GCC generally doesn't seem to care, but more recent versions of Clang seem to be enforcing our gnu99 override more strictly, which breaks the build in some configurations. Just bumping our "override" to match the kernel seems to be the easiest workaround. It's an effective no-op since 5.18, while still allowing us to build on older kernels. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17954	2025-12-01 10:19:11 -08:00
Alexx Saver	39303febac	chksum: run 256K benchmark on demand, preserve chksum_stat_data Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexx Saver <lzsaver.eth@ethermail.io> Co-authored-by: Adam Moss <c@yotes.com> Closes #17945 Closes #17946	2025-12-01 10:14:52 -08:00
Ameer Hamza	36e4f18883	Fix taskq NULL pointer dereference on timer race Remove unsafe timer_pending() check in taskq_cancel_id() that created a race where: - Timer expires and timer_pending() returns FALSE - task_done() frees task with tqent_func = NULL - Timer callback executes and queues freed task - Worker thread crashes executing NULL function Always call timer_delete_sync() unconditionally to ensure timer callback completes before task is freed. Reliably reproducible by injecting mdelay(10) after setting CANCEL flag to widen the race window, combined with frequent task cancellations (e.g., snapshot automount expiry). Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17942	2025-11-19 08:21:10 -08:00
Brian Behlendorf	a49158c064	icp: remove global icp includes Only include the required icp headers. There's no need to include sys/zfs_context.h and pull in all of the zfs headers. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:03:51 -08:00
Brian Behlendorf	801d9b4f96	debug: move all of the debug bits out of the spl Pull all of the internal debug infrastructure up in to the zfs code to clean up the layering. Remove all the dodgy usage of SET_ERROR and DTRACE_PROBE from the spl. Luckily it was lightly used in the spl layer so we're not losing much. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:02:51 -08:00
Rob Norris	faa295b9a6	libspl: move SID definitions from zfs_context.h; remove kernel gate Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17861	2025-11-12 10:01:48 -08:00
Mariusz Zaborski	02fdd26e51	Add knob to disable slow io notifications Introduce a new vdev property `VDEV_PROP_SLOW_IO_REPORTING` that allows users to disable notifications for slow devices. This prevents ZED and/or ZFSD from degrading the pool due to slow I/O. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes 17477	2025-11-11 10:42:17 -08:00
Alexander Motin	b4f073b5a6	Add BRT support to zpool prefetch command Implement BRT (Block Reference Table) prefetch functionality similar to existing DDT prefetch. This allows preloading BRT metadata into ARC to improve performance for block cloning operations and frees of earlier cloned blocks. Make -t parameter optional. When omitted, prefetch all supported metadata types (both DDT and BRT now). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17890	2025-11-10 16:16:22 -08:00
Alexander Motin	cc5cae5475	BRT: Increase block size from 4KB to 8KB According to my observations, BRT ZAPs are typically compressible 3:1 for data and 2:1 for indirects. With ashift=12, typical these days, it means increasing the block sizes to 8KB we may get most of possible compression, reducing on-disk and in-ARC BRT footprint in half by the cost of some compression/decompression overhead, but without real write inflation, only some dirty data increase. Increase to 32KB similar to DDT could further increase compression and storage efficiency, but at the cost of write inflation and much bigger dirty data increase, which we can not properly control now. So lets leave this for a time when BRT log gets implemented. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17916	2025-11-10 15:44:46 -08:00
Alexander Motin	72b2a9571a	ZAP: Remove dmu_object_info_from_dnode() call dmu_object_info_from_dnode() takes two locks and copies plenty of data that we don't need in zap_lockdir_impl(). Just read dn_type directly in this hot path. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17921	2025-11-10 14:26:15 -08:00
Rob Norris	6e12f0bd77	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-10 14:23:39 -08:00
Alexander Motin	baefe098ee	ZIO: Set minimum number of free issue threads to 32 Free issue threads might block waiting for synchronous DDT, BRT or GANG header reads. So unlike other taskqs using ZTI_SCALE to scale with number of CPUs, here we also need some amount of threads to potentially saturate pool reads. I am not sure we always want the 96 threads we had before ZTI_SCALE introduction at #11966 on small systems, but lets make it at least 32. While here, make free taskqs configurable, similar to read and write ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17903	2025-11-08 14:41:53 -05:00
rmacklem	e26b9fc871	FreeBSD: Add support for _PC_CASE_INSENSITIVE FreeBSD now has a pathconf name called _PC_CASE_INSENSITIVE used to check if a file system performs case insensitive name lookups. This patch adds support for this name. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17908	2025-11-08 13:20:23 -05:00
Brian Behlendorf	962474d1a2	zstd: disable intrinsics Disable the aarch64 NEON SIMD intrinsics for kernel builds. Safely using them in the kernel context requires saving/restoring the FPU registers which is not currently done. Additionally, remove the aarch64 optimized PREFETCH_L1 and PREFETCH_L2 instruction. Rely on the more portable compiler built ins. This lets us remove the problematic workaround in the aarch64_compat.h header which undefines the __aarch64__ macro. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17904 Closes #17852	2025-11-07 10:01:12 -08:00
Adi-Goll	54876ee85e	Fix typo in vdev_raidz.c Change the spelling of "begining" on line 4875 to "beginning". Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Adi Gollamudi <adigollamudi@gmail.com> Closes #17905	2025-11-07 09:55:03 -08:00
Tony Hutter	f93506d1df	Linux 6.17 compat: Fix broken projectquota on 6.17 We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_attr() codepaths, and the FS__FL macros in the zpl_ioctl_flags() codepaths. The earlier code just assumes the FS__FL macros for both codepaths. The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that exposed this error via failing 'projectquota' ZTS tests. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17884 Closes #17869	2025-11-05 16:22:03 -08:00
Paul Dagnelie	8c225ff1b4	Fix gang write late_arrival bug When a write comes in via dmu_sync_late_arrival, its txg is equal to the open TXG. If that write gangs, and we have not yet activated the new gang header feature, and the gang header we pick can store a larger gang header, we will try to schedule the upgrade for the open TXG + 1. In debug mode, this causes an assertion to trip. This PR sets the TXG for activating the feature to be the larger of either the current open TXG or the syncing TXG + 1. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17824	2025-11-05 11:40:22 -08:00
Robert Evans	d0294aa758	Update dnode_next_offset_level to accept blkid instead of offset Currently this function uses L0 offsets which: 1. is hard to read since it maps offsets to blkid and back each call 2. necessitates dnode_next_block to handle edge cases at limits 3. makes it hard to tell if the traversal can loop infinitely Instead, update this and dnode_next_offset to work in (blkid, index). This way the blkid manipulations are clear, and it's also clear that the traversal always terminates since blkid goes one direction. I've also considered updating dnode_next_offset to operate on blkid. Callers use both patterns, so maybe another PR can split the cases? While here tidy up dnode_next_offset_level comments. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com> Closes #17792	2025-11-04 13:12:17 -08:00
Alexander Motin	6cfc3dba9c	Cleanup ZIO_FLAG_IO_RETRY vs TRYHARD usage In cases where all issued ZIOs must succeed, and we can't do anything clever about the errors, we should just explicitly set ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries. In other cases, where retries can be different from the original, for example, some ZIOs are allowed to fail due to redundancy, or we can disable aggregation on retrial to get at least some of the data, we can do first pass without TRYHARD, and only if needed retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics). Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17877	2025-10-30 16:29:48 -07:00
Alexander Motin	ec268cdf97	Fix caching of DDT log and BRT Both DDT log and BRT counters we read on pool import and then only append or overwrite in full blocks. We don't need them in DMU or ARC caches. Fortunately we have DMU_UNCACHEDIO for this now. Even more we don't need BRT in non-evictable metadata DMU caches, since it will likely never fit there, while block the cache from its original users. Since DMU_OT_IS_METADATA_CACHED() has no way to differentiate the new metadata types, mark BRT with storage type of DMU_OT_DDT_ZAP. As side effect it will also put it on dedup device, but that should actually be right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-10-30 16:28:28 -07:00
Alexander Motin	ea125eeb5d	BRT: Round bv_entcount up to BRT_BLOCKSIZE Since we set bv_mos_brtvdev block size, and since we keep dirty bitmap at the same granularity, we should keep the allocations and writes done with. Otherwise it makes the last block write short, that will be odd once we implement writing of only dirty blocks, but also requires read-modify-write on DMU layer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17875	2025-10-30 16:28:05 -07:00
Alexander Motin	dcada084b9	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-10-29 11:17:51 -07:00
Ryan Libby	0455150f11	FreeBSD zio_crypt.c: initialize uio variables before access In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was not initialized before the calls to zfs_uio_init, leading to uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc warnings. Similar issue as fixed in `2bf152021` ("Fix gcc uninitialized warning in FreeBSD zio_crypt.c") Signed-off-by: Ryan Libby <rlibby@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17863	2025-10-23 21:23:25 -04:00
Jean-Sébastien Pédron	3a55e76b84	FreeBSD: zfs_getpages: Don't zero freshly allocated pages Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Jean-Sébastien Pédron <dumbbell@FreeBSD.org> Closes #17851	2025-10-20 17:04:21 -07:00
Rob Norris	fe8b50f09f	Linux 6.18: generic_drop_inode() and generic_delete_inode() renamed Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	3651888182	sha256_generic: make internal functions a little more private Linux 6.18 has conflicting prototypes for various sha256_* and sha512_* functions, which we get through a very long include chain. That's tough to fix right now; easier is just to rename our internal functions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	8911360a41	Linux 6.18: namespace type moved to ns_common The namespace type has moved from the namespace ops struct to the "common" base namespace struct. Detect this and define a macro that does the right thing for both versions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	76c238f1ba	Linux 6.18: replace write_cache_pages() Linux 6.18 removed write_cache_pages() without a usable replacement. Here we implement a minimal zpl_write_cache_pages() that find the dirty pages within the mapping, gets them into the expected state and hands them off to zfs_putpage(), which handles the rest. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	39db4bda80	Linux 6.18: block_device_operations->getgeo takes struct gendisk* Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	5de4a297e7	Linux 6.18: convert ida_simple_* calls ida_simple_get() and ida_simple_remove() are removed in 6.18. However, since 4.19 they have been simple wrappers around ida_alloc() and ida_free(), so we can just use those directly. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Rob Norris	9d50ee59dc	Linux 6.18: replace nth_page() Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-20 16:01:04 -07:00
Andrew Walker	adacf020ce	Fix return value for setting zvol threading We must return -1 instead of ENOENT if the special zvol threading property set function can't locate the dataset (this would typically happen with an encypted and unmounted zvol) so that the operation gets inserted properly into the nvlist for operations to set. This is because we want the property to be set once the zvol is decrypted again. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #17836	2025-10-20 15:21:40 -07:00
Andrew Walker	783a02b5d3	Fix ZFS_READONLY implementation on Linux MS-FSCC 2.6 is the governing document for DOS attribute behavior. It specifies the following: For a file, applications can read the file but cannot write to it or delete it. For a directory, applications cannot delete it, but applications can create and delete files from the directory. Signed-off-by: Andrew Walker <awalker@ixsystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Closes #17837	2025-10-20 09:28:57 -04:00
Brian Behlendorf	5a03e358fc	Update device removal documentation Make a minor update to the 'zpool remove' man page to clarify both raidz and draid pools do not support removal, and change sector to ashift which is what we actually care about. Update the big theory comment in vdev_removal.c to accurately reflect which types of vdevs can be removed. Furthermore, I've added some discussion for the casual reader to briefly explain the top-level vdev removal restrictions. This has been a common area of confusion and it's not intuitive where they come from without understanding the implementation details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17847	2025-10-20 09:26:51 -04:00
Shreshth3	a5af3f2db7	arc: fix small typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com> Closes #17840	2025-10-13 11:23:55 -07:00
Mark Johnston	a9f2a1f361	Fix the type of the raidz_outlier_check_interval_ms parameter It's an hrtime_t, which is an unsigned long long. In practice this is just a U64. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #17833	2025-10-13 10:47:09 -07:00
Dag-Erling Smørgrav	6e5b836e9f	FreeBSD: Correct _PC_MIN_HOLE_SIZE The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway. * In the general case, if the vnode is a regular file, return its current block size, or the record size if the file is smaller than its own block size. If the vnode is a directory, return the dataset record size. If it is neither a regular file nor a directory, return EINVAL. * In the control directory case, always return EINVAL. Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17750	2025-10-08 09:13:22 -04:00
Tony Hutter	1861a329fb	zvol: verify IO type is supported ZVOLs don't support all block layer IO request types. Add a check for the IO types we do support. Also, remove references to io_is_secure_erase() since they are not supported on ZVOLs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17803	2025-10-06 16:54:09 -07:00
Mateusz Guzik	346ecac61b	Annotate arc_buf_is_shared as __maybe_unused Otherwise the compiler warns about it on production FreeBSD builds. The routine proved resilient to attempts to ifdef on debug. Sponsored by: Rubicon Communications, LLC ("Netgate") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #17818	2025-10-06 16:43:20 -07:00
Igor Ostapenko	cb3c18a9a9	ddt prune: Add SCL_ZIO deadlock workaround Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17793	2025-10-01 15:17:09 -07:00
Igor Ostapenko	e829e2fd04	spa_config: Rename spa_config_enter_mmp() to spa_config_enter_priority() Originally this was created for MMP, but now new cases are emerging where the same mechanism is required. Hence the name's generalization. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17793	2025-10-01 15:16:04 -07:00
Robert Evans	8869caae5f	zinject: Introduce ready delay fault injection This adds a pause to the ZIO pipeline in the ready stage for matching I/O (data, dnode, or raw bookmark). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #17787	2025-10-01 12:17:13 -07:00
Paul Dagnelie	fa4d4b1f80	Fix display of default xattr to show 'sa' When the default value of the xattr property was changed from 'dir' to 'sa', the code that displays the property's value was not affected. The problem with this state of affairs is that 1) user tooling that specifically looked for 'sa' before will be confused now that the code displays 'on' instead. And 2) users may be confused when manually running the commands about which specific type of xattr is in use unless they are up to date on the latest zfs changes. The fix here is to show the actual type always, rather than 'on' if we happen to be using the default. This turns out to be easy to do, by simply reordering the list of xattr values in the properties code. When the property is displayed, we iterate down the table until we find a row with a matching value, and use that row's name as the display. Reordering the row fixes the display without affecting any other code. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17801	2025-10-01 12:14:56 -07:00
hoshinomori	e4a407f29f	range_tree: drop duplicate zfs_ prefix from rs_set_fill_raw Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: hoshinomori <hoshinomori@owarisekai.moe> Closes #17800	2025-09-29 16:38:52 -07:00
Tony Hutter	8d4c3ee9e6	zvol: Fix blk-mq sync The zvol blk-mq codepaths would erroneously send FLUSH and TRIM commands down the read codepath, rather than write. This fixes the issue, and updates the zvol_misc_fua test to verify that sync writes are actually happening. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17761 Closes #17765	2025-09-29 16:29:20 -07:00
Robert Evans	26b0f561be	dnode_next_offset: backtrack if lower level does not match This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before `f664f1ee7f` (#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is #11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #16025 Closes #11196	2025-09-25 11:06:28 -07:00
Brian Behlendorf	c722bf8812	Add interface to interface spa_get_worst_case_min_alloc() function Provide an interface to retrieve the lowest and highest minimum allocation size for the normal allocation class. This can be used by external consumers of the DMU to estimate potential wasted capacity when setting the recordsize for an object. The new "min_alloc" and "max_alloc" keys are added to the pool configuration and used by default_volblocksize() to warn when an ineffecient block size is requested. For older kmods which don't yet include the new keys fallback to the previous logic. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17758	2025-09-25 09:35:35 -07:00
Rob Norris	ab8cc63c77	linux/super: add tunable to request immediate reclaim of unused dentries Traditionally, unused dentries would be cached in the dentry cache until the associated entry is no longer on disk. The cached dentry continues to hold an inode reference, causing the inode to be pinned (see previous commit). Here we implement the dentry op d_delete, which is roughly analogous to the drop_inode superblock op, and add a zfs_delete_dentry tunable to control its behaviour. By default it continues the traditional behaviour, but when the tunable is enabled, we signal that an unused dentry should be freed immediately, releasing its inode reference, and so allowing that inode to be deleted if no longer in use. Sponsored-by: Klara, Inc. Sponsored-by: Fastmail Pty Ltd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17746	2025-09-17 08:16:32 -07:00
Rob Norris	ab93b4b70e	linux/super: add tunable to request immediate reclaim of unused inodes Traditionally, unused inodes would be held on the superblock inode cache until the associated on-disk file is removed or the kernel requests reclaim. On filesystems with millions of rarely-used files, this can be a lot of unusable memory. Here we implement the superblock drop_inode method, and add a zfs_delete_inode tunable to control its behaviour. By default it continues the traditional behaviour, but when the tunable is enabled, we signal that the inode should be deleted immediately when the last reference is dropped, rather than cached. This releases the associated data to the dbuf cache and ARC, allowing them to be reclaimed normally. Sponsored-by: Klara, Inc. Sponsored-by: Fastmail Pty Ltd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17746	2025-09-17 08:15:56 -07:00
Rob Norris	f319ff3570	vdev_disk_close: take disk write lock before destroying it Many IO operations are submitted to the kernel async, and so the zio can complete and followup actions before the submission call returns. If one of the followup actions closes the disk (eg during pool create/import), the initiator may be left holding a lock on the disk at destruction. Instead, take the write lock before finishing up and decoupling the disk state from the vdev proper. The caller will hold until all IO is submitted and locks released. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17719	2025-09-15 09:12:24 -07:00
Alexander Motin	3f4312a0a4	Fix two infinite loops if dmu_prefetch_max set to zero Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17692 Closes #17729	2025-09-13 12:58:48 -04:00
Paul Dagnelie	9b772f328b	Fix time database update calculations The time database update math assumed that the timestamps were in nanoseconds, but at some point in the development or review process they changed to seconds. This PR fixes the math to use seconds instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17735	2025-09-12 16:33:36 -07:00
Alexander Motin	bc8bcfc71a	Fix type in dbrrd_closest() For ABS() to work, the argument must be signed, but rrdd_time is uint64_t. Clang noticed it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Fixes #16853 Closes #17733	2025-09-12 11:05:38 -07:00
Alexander Motin	cb5f9aa582	FreeBSD: Satisfy ASSERT_VOP_IN_SEQC() zfs_aclset_common() might be called for newly created or not even created vnodes, that triggers assertions on newer FreeBSD versions with DEBUG_VFS_LOCKS included into INVARIANTS. In the first case make sure to call vn_seqc_write_begin()/_end(), in the second just skip the assertion. The similar has to be done for project management IOCTL and file- bases extended attributes, since those are not going through VFS. Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17722	2025-09-12 13:29:27 -04:00
Chunwei Chen	37cd30f714	Fix ddle memleak in ddt_log_load In ddt_log_load(), when removing dup entry from flushing tree, it doesn't free the entry causing memleak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #17657 Closes #17730	2025-09-12 10:05:06 -07:00
Allan Jude	7b1cc9eb61	ZFS allow send:encrypted A new `zfs allow` permissions that ONLY allows sending replication streams in raw (encrypted) mode, so encrypted data will not be decrypted as part of the replication process. Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Co-authored-by: JT Pennington <jt.pennington@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #17543	2025-09-12 09:53:31 -07:00
Brian Behlendorf	bc0b5318aa	Prevent scrubbing a read-only pool While it would be nice to be able to scrub a pool imported read-only this will currently trip an ASSERT. Before we can support this there are some designs challenges which need to be thought through first. For starters, a read-only import skips reading certain information from disk which it knows won't be needed, such as the space maps. Furthermore, the scrub process expects to be checkpoint it's progress, update the on disk error log, and issue repair IO. None of which would be possible when the pool is imported read-only. Each of these wrinkles can certainly be handled, but that will take some signifcant work. In the meanwhile we disable the 'zpool scrub' command when the pool is imported read-only. Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #17527 Closes #17717	2025-09-11 10:58:46 -07:00
Paul Dagnelie	d64711c202	Detect a slow raidz child during reads A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Contributions-by: Don Brady <don.brady@klarasystems.com> Contributions-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17227	2025-09-10 15:25:03 -07:00
Paul Dagnelie	0620c979a5	Remove RAIDZ reconstruct flags from debug defaults Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17227	2025-09-10 15:24:50 -07:00
Paul Dagnelie	bc4aac0395	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 11:13:55 -07:00
Rob Norris	7939bad5e7	Linux 6.17: d_set_d_op() is no longer available We only have extremely narrow uses, so move it all into a single function that does only what we need, with and without d_set_d_op(). Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17621	2025-09-09 13:44:43 -07:00
Alan Somers	e29bfa5bd0	Fix warnings about sha2_is_supported on FreeBSD/i386 This is one problem currently preventing OpenZFS from building on FreeBSD/i386. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #17704	2025-09-09 09:56:38 -07:00
rmacklem	59f8f5dfe1	zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which is the block size supported for block cloning for the file system. Since ZFS's block size varies per file, return the largest size likely to be used, or zero if block cloning is not supported. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17645	2025-09-09 08:52:40 -07:00
Chunwei Chen	e3c3e86c04	Fix wrong dedup_table_size for legacy dedup If we call ddt_log_load() for legacy ddt, we will end up going into ddt_log_update_stats() and filling uninitialized value into ddo_dspace. This value will then get added to dedup_table_size during ddt_get_dedup_object_stats(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17019 Closes #17699 Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com>	2025-09-08 14:02:51 -07:00
Rob Norris	ced72fdd69	tunables: remove legacy FreeBSD aliases These are old pre-OpenZFS tunable names that have long been available via either conventional ZFS_MODULE_PARAM tunables or through kstats. There's no point doubling up anymore, so delete them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17375	2025-09-08 10:03:01 -07:00
Rob Norris	64d3143e82	zvol: reject suspend attempts when zvol is shutting down Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17690	2025-09-03 11:13:09 -07:00
Ivan Shapovalov	14bad10f96	config: add and use KERNEL_CC check for `-Wno-format-zero-length` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16997	2025-08-25 11:26:13 -07:00
youzhongyang	b6bd3228bb	Synchronize the update of feature refcount The concurrent execution of feature_sync() can lead to a panic due to an unprotected update of the feature refcount. Resolve this by using the spa->spa_feat_stats_lock to synchronize the update of the refcount. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #17184 Closes #17632	2025-08-22 16:35:58 -07:00
Konstantin Belousov	28ff57505b	FreeBSD: satisfy VFS requirements for readdir() zfsctl_root_readdir(): properly set eof. readdir(): set *eofp to 1 on eof. If there were no dirents to copy out, return EINVAL same as UFS. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Konstantin Belousov <kib@FreeBSD.org> Closes #17655	2025-08-22 09:14:36 -04:00
Rob Norris	574eec2964	dnode: remove dn_dirtyctx and dnode_dirtycontext Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:38 -07:00
Rob Norris	aa6f0f878b	dnode: remove dn_dirtyctx_firstset Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:36 -07:00
Rob Norris	eecff1b4a9	dnode: remove dn_dirty_txg and DNODE_IS_DIRTY dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:35 -07:00
Rob Norris	f3e49b0cf5	dnode_is_dirty: reimplement in terms of dn_dirtycnt Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:33 -07:00
Rob Norris	3abf72b251	dnode: add dn_dirtycnt, count of number of txgs this dnode is dirty on Bumped when we take the dirty hold in dnode_setdirty(), dropped when the dnode is finally cleaned up after sync in dnode_rele_task() or userquota_updates_task(). This gives us a way to check if the dnode is dirty on any txg without having to rely on outside information (eg presence on a dirty list), which has been a rich source of bugs in the past. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Suggested-by: Robert Evans <evansr@google.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:29 -07:00
Dag-Erling Smørgrav	2c877e8453	FreeBSD: Set st_rdev to NODEV, not 0, when not a device Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Closes #17649	2025-08-19 12:42:21 -07:00
Rob Norris	dcd73069f0	zvol_remove_minors_impl: remove all async fallbacks Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:47 -07:00
Rob Norris	8a0e5e8b54	zvol: stop using zvol_state_lock to protect OS-side private data zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:34 -07:00
Rob Norris	96f9d271ea	zvol: remove the OS-side minor before freeing the zvol When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:21 -07:00
Rob Norris	b2c792778c	zvol: generalise zvol_remove_minors_impl() for single zvol case zvol_remove_minor_impl() and zvol_remove_minors_impl() should be identical except for how they select zvols to remove, so lets just use the same function with a flag to indicate if we should include children and snapshots or not. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:11 -07:00
Brian Behlendorf	5061f959d1	Retire zfs_autoimport_disable kmod option Back in 2014 the zfs_autoimport_disable module option was added to control whether the kmods should load the pool configs from the cache file on module load. The default value since that time has been for the kernel to not process the cache file. Detecting and importing pools during boot is now controlled outside of the kmod on both Linux and FreeBSD. By all accounts this has been working well and we can remove this dormant code on the kernel side. The spa_config_load() function is has been moved to userspace, it is now only used by libzpool. Additionally, the spa_boot_init() hook which was used by FreeBSD now looks to be used and was removed. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17618	2025-08-14 14:58:58 -07:00
Alexander Motin	d151432073	ZIL: Make allocations more flexible When ZIL allocates space for new LWBs without knowing how much it will require, it can use new metaslab_alloc_range() function to allocate slightly more or less than it predicted. It allows to improve space efficiency by allocating bigger LWBs on RAIDZ/dRAID instead of padding and possibly packing more ZIL records there. It may also allow to reduce ganging in some cases by allowing to allocate smaller LWBs when we are not sure we'll need bigger. On the opposite side, when we allocate space for already closed LWBs, when we precisely know how much space we need, we may just allocate what we need instead of relying on writing less than allocated, that does not work for RAIDZ. Space for LWBs in open state (still being filled) is allocated same as before. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17613	2025-08-14 08:50:17 -07:00
Rob Norris	28433c4547	simd_stat: expose availability of VAES and VPCLMULQDQ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Joel Low <joel@joelsplace.sg> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17058	2025-08-13 14:53:24 -07:00
Joel Low	bb9225ea86	Backport AVX2 AES-GCM implementation from BoringSSL This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on Zen 3 this provides an up to 80% performance improvement. Original source: `d5440dd2c2/gen/bcm/aes-gcm-avx2-x86_64-linux.S` See the original BoringSSL commit at `3b6e1be439`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Joel Low <joel@joelsplace.sg> Closes #17058	2025-08-13 14:51:20 -07:00
Alexander Motin	885d929cf8	Fix missed assertion update in physical rewrite patch Physical rewrite patch changed the meaning of BP_GET_BIRTH(), but I missed update one of its occurences, ending up asserting equal logical birth times instead of equal physical birth times. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Fixes #17565 Closes #17631	2025-08-13 15:56:25 -04:00
Jitendra Patidar	077269bfed	Fix Assert in dbuf_undirty, which triggers during usage zap shrink Usage zap's (DMU_USED_OBJECT) are updated in syncing context via do_userquota_cacheflush(). zap shrink triggers, ASSERT(db->db_objset == dmu_objset_pool(db->db_objset)->dp_meta_objset \|\| txg != spa_syncing_txg(dmu_objset_spa(db->db_objset))); DMU_USED_OBJECT are special object (DMU_OBJECT_IS_SPECIAL), gets updated in syncing context only. So, relax assert for it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #17602	2025-08-12 14:19:05 -07:00
Brian Behlendorf	152e34822b	Silence zstd large allocation warning Allow zstd_mempool_init() to allocate using vmem_alloc() instead of kmem_alloc() to silence the large allocation warning on Linux during module load when the system has a large number of CPUs. It's not at all clear to me that scaling the allocation size with the number of CPUs is beneficial and that should be evaluated. But for the moment this should resolve the warning without introducing any unexpected side effects. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17620 Closes #11557	2025-08-12 13:38:08 -07:00
Brian Behlendorf	1ccae433e9	Allow vmem_alloc backed multilists Systems with a large number of CPU cores (192+) may trigger the large allocation warning in multilist_create() on Linux. Silence the warning by converting the allocation to vmem_alloc(). On Linux this results in a call to kvalloc() which will alloc vmem for large allocations and kmem for small allocations. On FreeBSD both vmem_alloc and kmem_alloc internally use the same allocator so there is no functional change. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17616	2025-08-12 13:36:03 -07:00
Rob Norris	531568f438	zil_suspend: fix cookie leak if ZIL crashes during wait Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:32 -07:00
Rob Norris	7c9adc6858	zil_process_commit_list: fail better if the pool suspends in stall Make sure we properly inform the nolwb waiters of the error, and don't keep trying. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:27 -07:00
Rob Norris	f562e0f691	ZIL: single zil_commit_waiter_done() function to complete a waiter Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:22 -07:00
Rob Norris	92da3e18c8	ZIL: flag crashed LWBs so we know not to process them If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:16 -07:00
Rob Norris	508c546975	ZIL: use a bitfield for LWB "slog" and "slim" state flags I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:23:59 -07:00
Rob Norris	2fd145b578	zvol: cleanup error handling and passthrough This is trying to get all the uses and non-uses of SET_ERROR correct (being: only call it if we're the originator of an error _within ZFS_), and correctly negating errors going to/from the kernel. And/or both. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17605	2025-08-08 17:04:01 -07:00
Rob Norris	90a1e13df2	Linux: zfs_sync: remove explicit suspend check Since zil_commit_flags(NOW) will always return error if the pool is suspended, there's no need for a separate suspend check here. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:50 -07:00
Rob Norris	ef4058fcdc	FreeBSD: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC) writeback. The only thing we need to do when a page writeback fails is to skip marking the page clean ("undirty"), since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an async writeback on the vnode cache first, so any dirty data has been recorded in the ZIL, ready for the followup zfs_sync()->zil_commit() to find. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:44 -07:00
Rob Norris	3d6ee9a68c	Linux: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback (kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER). The only thing we need to do when a page writeback fails is to re-mark the page dirty, since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the file mapping at the start of the sync op (fsync(), msync(), etc). An async op will get an async itx created and logged, ready for the followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit() call for every page in the range. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:38 -07:00
Rob Norris	391e85f519	ZIL: add zil_commit_flags() to make honouring failmode= optional The vast majority of calls to zil_commit() follow VFS ops, and should honour the failmode= setting - either wait for sync, or return error. Some calls however are part of a larger syncing op, and shouldn't ever block if something goes wrong. To allow this, we introduce zil_commit_flags(), with a flag ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should be honoured. zil_commit() is now a wrapper that always sets this flag, but any caller wanting a different behaviour can request ZIL_COMMIT_NOW instead to have the call return failure if the pool suspends, regardless of the failmode= setting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:33 -07:00
Rob Norris	72602f6ad9	ZIL: "crash" the ZIL if the pool suspends during fallback If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks on suspend. We want it to not block on suspend, instead returning an error. On the surface, this is simple: change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller. Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit(). However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. (As an aside, ITXs can also supply completion callbacks, which are called when they are destroyed. These are directly connected to LWBs though, so are passed the error code and destroyed there too). At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to. The simplest solution I could find, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to three txgs in the future (syncing + TXG_CONCURRENT_STATES), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()). This commit adds zil_crash(), which handles all of the above: - sets the restart txg - capture and signal all waiters - zero the header zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN). The rest of the commit is just threading the errors through, and related housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:26 -07:00

1 2 3 4 5 ...

5145 Commits