mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-11 04:46:18 +03:00

Author	SHA1	Message	Date
Ameer Hamza	b8610c3d93	L2ARC: Reorder header destruction for in-flight L2 writes With multiple L2ARC devices, headers can be destroyed asynchronously (e.g., during zpool sync) while L2_WRITING is set. The original code destroyed L2HDR before L1HDR, causing ABDs to lose their device association (b_l2hdr.b_dev) when arc_hdr_free_abd() is called. This caused ABDs to be added to the global free-on-write list without device information. When any L2ARC device completed its write and attempted to free these orphaned ABDs, it would panic on ASSERT(!list_link_active(&abd->abd_gang_link)) because the ABD was still part of another device's vdev_queue I/O aggregation gang. Fix by extending l2ad_mtx lock scope to cover L1HDR destruction and reordering to destroy L1HDR before L2HDR when L2_WRITING is set. This ensures arc_hdr_free_abd() can access b_l2hdr.b_dev to properly tag ABDs with their device for deferred cleanup. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:51 -08:00
Ameer Hamza	2f41b9d865	L2ARC: Implement persistent markers with consistent tail scanning This commit introduces per-sublist persistent markers that eliminate redundant tail scanning between L2ARC iterations, providing significant CPU efficiency improvements. Markers are pre-allocated during device initialization and properly cleaned up during device removal. The implementation uses conditional behavior based on device capacity: small devices (capacity < arc_c) retain original HEAD/TAIL scanning based on ARC warmup state, while large devices (capacity >= arc_c) use the persistent marker approach for optimal CPU efficiency. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:06:47 -08:00
Ameer Hamza	3523b5f3f9	L2ARC: Implement even-depth multi-sublist scanning The introduction of ARC multilists made L2ARC writing quite random, depending on whether it found something to write in a randomly selected sublist. This created inconsistent write patterns and poor utilization of available sublists leading to uneven cache population. This commit replaces random selection with systematic scanning across all sublists within each burst. Fair headroom distribution ensures even-depth traversal across all sublists until the target write size is reached. Round-robin processing with random starting points eliminates sequential bias while maintaining predictable write behavior. The systematic approach provides consistent L2ARC filling patterns and better utilization of available ARC data across all sublists. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #18093	2026-02-04 10:05:53 -08:00
Alexander Motin	d393166c54	ARC: Increase parallel eviction batching Before parallel eviction implementation zfs_arc_evict_batch_limit caused loop exits after evicting 10 headers. The cost of it is not big and well motivated. Now though taskq task exit after the same 10 headers is much more expensive. To cover the context switch overhead of taskq introduce another level of batching, controlled by zfs_arc_evict_batches_limit tunable, used only for parallel eviction. My tests including 36 parallel reads with 4KB recordsize that shown 1.4GB/s (~460K blocks/s) before with heavy arc_evict_lock contention, now show 6.5GB/s (~1.6M blocks/s) without arc_evict_lock contention. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17970	2025-12-10 13:03:01 -08:00
Alexander Motin	a62c62120e	ARC: Pre-convert zfs_arc_min_prefetch_ms There is no need to do MSEC_TO_TICK() for each evicted ARC header. We can do it when tunables are set, since we already have separate internal variables for those. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17965	2025-12-09 12:07:10 -08:00
Rob Norris	6e12f0bd77	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-10 14:23:39 -08:00
Shreshth3	a5af3f2db7	arc: fix small typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com> Closes #17840	2025-10-13 11:23:55 -07:00
Mateusz Guzik	346ecac61b	Annotate arc_buf_is_shared as __maybe_unused Otherwise the compiler warns about it on production FreeBSD builds. The routine proved resilient to attempts to ifdef on debug. Sponsored by: Rubicon Communications, LLC ("Netgate") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #17818	2025-10-06 16:43:20 -07:00
Rob Norris	ced72fdd69	tunables: remove legacy FreeBSD aliases These are old pre-OpenZFS tunable names that have long been available via either conventional ZFS_MODULE_PARAM tunables or through kstats. There's no point doubling up anymore, so delete them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17375	2025-09-08 10:03:01 -07:00
Rob Norris	82d6f7b047	Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:42 -07:00
Rob Norris	f7bdd84328	Prefer VERIFY0P(n) over VERIFY(n == NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:37 -07:00
Rob Norris	5c7df3bcac	Prefer VERIFY0(n) over VERIFY3U(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:25 -07:00
Rob Norris	c39e076f23	Prefer VERIFY0(n) over VERIFY(n == 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:40:59 -07:00
Alexander Motin	4ae8bf406b	Allow physical rewrite without logical During regular block writes ZFS sets both logical and physical birth times equal to the current TXG. During dedup and block cloning logical birth time is still set to the current TXG, but physical may be copied from the original block that was used. This represents the fact that logically user data has changed, but the physically it is the same old block. But block rewrite introduces a new situation, when block is not changed logically, but stored in a different place of the pool. From ARC, scrub and some other perspectives this is a new block, but for example for user applications or incremental replication it is not. Somewhat similar thing happen during remap phase of device removal, but in that case space blocks are still acounted as allocated at their logical birth times. This patch introduces a new "rewrite" flag in the block pointer structure, allowing to differentiate physical rewrite (when the block is actually reallocated at the physical birth time) from the device reval case (when the logical birth time is used). The new functionality is not used at this point, and the only expected change is that error log is now kept in terms of physical physical birth times, rather than logical, since if a block with logged error was somehow rewritten, then the previous error does not matter any more. This change also introduces a new TRAVERSE_LOGICAL flag to the traverse code, allowing zfs send, redact and diff to work in context of logical birth times, ignoring physical-only rewrites. It also changes nothing at this point due to lack of those writes, but they will come in a following patch. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:07 -07:00
shodanshok	a7a144e655	enforce arc_dnode_limit Linux kernel shrinker in the context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17487 Closes #17542	2025-07-21 10:32:01 -07:00
Rob Norris	b0e053a10d	tunables: ensure tunable and variable have same define gate If a variable is only available in the kernel, then the tunable should also only be available there. This matters very little so long as we don't have userspace tunables, but its still good hygeine. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Allan Jude	b6916f995e	ARC: parallel eviction On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Closes #16486	2025-05-14 10:38:32 -04:00
Alexander Motin	89a8a91582	ARC: Notify dbuf cache about target size reduction ARC target size might drop significantly under memory pressure, especially if current ARC size was much smaller than the target. Since dbuf cache size is a fraction of the target ARC size, it might need eviction too. Aside of memory from the dbuf eviction itself, it might help ARC by making more buffers evictable. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17314	2025-05-14 10:34:14 -04:00
Alexander Motin	734eba251d	Wire O_DIRECT also to Uncached I/O (#17218 ) Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17027 Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-05-13 14:26:55 -07:00
Alexander Motin	b1ccab1721	ARC: Avoid overflows in arc_evict_adj() (#17255 ) With certain combinations of target ARC states balance and ghost hit rates it was possible to get the fractions outside of allowed range. This patch limits maximum balance adjustment speed, which should make it impossible, and also asserts it. Fixes #17210 Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-05-06 09:31:38 -07:00
Paul Dagnelie	9250403ba6	Make ganging redundancy respect redundant_metadata property (#17073 ) The redundant_metadata setting in ZFS allows users to trade resilience for performance and space savings. This applies to all data and metadata blocks in zfs, with one exception: gang blocks. Gang blocks currently just take the copies property of the IO being ganged and, if it's 1, sets it to 2. This means that we always make at least two copies of a gang header, which is good for resilience. However, if the users care more about performance than resilience, their gang blocks will be even more of a penalty than usual. We add logic to calculate the number of gang headers copies directly, and store it as a separate IO property. This is stored in the IO properties and not calculated when we decide to gang because by that point we may not have easy access to the relevant information about what kind of block is being stored. We also check the redundant_metadata property when doing so, and use that to decide whether to store an extra copy of the gang headers, compared to the underlying blocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-19 15:58:29 -07:00
Alexander Motin	3cd9934a48	Some arc_release() cleanup - Don't drop L2ARC header if we have more buffers in this header. Since we leave them the header, leave them the L2ARC header also. Honestly we are not required to drop it even if there are no other buffers, but then we'd need to allocate it a separate header, which we might drop soon if the old block is really deleted. Multiple buffers in a header likely mean active snapshots or dedup, so we know that the block in L2ARC will remain valid. It might be rare, but why not? - Remove some impossible assertions and conditions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17126	2025-03-18 21:25:50 -04:00
Rob Norris	eb9098ed47	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:27 -07:00
Ameer Hamza	ab3db6d15d	arc: avoid possible deadlock in arc_read In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17071	2025-02-25 14:32:12 -05:00
Chunwei Chen	682c5f6a0a	Fix wrong free function in arc_hdr_decrypt Need to use arc_free_data_abd to free abd type buffer. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #17079	2025-02-22 13:50:33 -05:00
Jaydeep Kshirsagar	21205f6488	Avoid ARC buffer transfrom operations in prefetch This change will prevent prefetch to perform unnecessary ARC buffer fill when reading from disk. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17013	2025-02-01 11:15:24 -05:00
Don Brady	44446dccdb	During pool export flush the ARC asynchronously This also includes removing L2 vdevs asynchronously. This commit also guarantees that spa_load_guid is unique. The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! So we guarantee that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16215	2024-12-05 08:58:20 -08:00
Alexander Motin	0ca82c5680	L2ARC: Stop rebuild before setting spa_final_txg Without doing that there is a race window on export when history log write by completed rebuild dirties transaction beyond final, triggering assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16714 Closes #16782	2024-11-20 06:11:51 -08:00
Alexander Motin	534688948c	Remove hash_elements_max accounting from DBUF and ARC Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16759	2024-11-19 07:00:16 -08:00
Alexander Motin	a60ed3822b	L2ARC: Move different stats updates earlier ..., before we make the header or the log block visible to others. It should fix assertion on allocated space going negative if the header is freed once the lock is dropped, while the write is still going. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16040 Closes #16743	2024-11-13 07:31:50 -08:00
Alexander Motin	4ebe674d91	ARC: Cache arc_c value during arc_evict() Since arc_evict() run can take some time, arc_c change during it may result in undesired shift in ARC states balance. Primarily in case of arc_c reduction it may cause eviction from MFU data state despite its being below the target already. Instead we should evict as originally planned and if needed do another round after. Reviewed-by: Theera K. <tkittich@hotmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16576 Closes #16605	2024-10-04 10:56:43 -07:00
Theera K.	d40d40913d	Evicting too many bytes from MFU metadata Without updating 'm' we evict from MFU metadata all that we wanted to evict from all metadata, including already evicted MRU metadata ('m' is the total amount of metadata we had at the beginning, and 'w' is the total amount of metadata we want to have). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Theera K. <tkittich@hotmail.com> Closes #16521 Closes #16546	2024-09-23 22:12:56 -07:00
George Melikov	01852ffbf8	arc_hdr_authenticate: make explicit error On compression we could be more explicit here for cases where we can not recompress the data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #9416	2024-09-19 17:25:02 -07:00
George Melikov	b32d48a625	ZLE compression: don't use BPE_PAYLOAD_SIZE ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #9416	2024-09-19 17:24:51 -07:00
George Melikov	522f2629c8	zio_compress: introduce max size threshold Now default compression is lz4, which can stop compression process by itself on incompressible data. If there are additional size checks - we will only make our compressratio worse. New usable compression thresholds are: - less than BPE_PAYLOAD_SIZE (embedded_data feature); - at least one saved sector. Old 12.5% threshold is left to minimize affect on existing user expectations of CPU utilization. If data wasn't compressed - it will be saved as ZIO_COMPRESS_OFF, so if we really need to recompress data without ashift info and check anything - we can just compress it with zero threshold. So, we don't need a new feature flag here! Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #9416	2024-09-19 17:23:58 -07:00
Brian Atkinson	a10e552b99	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov> Closes #10018	2024-09-14 13:47:59 -07:00
Rob Norris	f62e6e1f98	compress: change zio_compress API to use ABDs This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
shodanshok	77a797a382	Enable L2 cache of all (MRU+MFU) metadata but MFU data only `l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU data and metadata. However it can be useful to cache as much metadata as possible while, at the same time, restricting data cache to MFU buffers only. This patch allow for such behavior by setting `l2arc_mfuonly` to 2 (or higher). The list of possible values is the following: 0: cache both MRU and MFU for both data and metadata; 1: cache only MFU for both data and metadata; 2: cache both MRU and MFU for metadata, but only MFU for data. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #16343 Closes #16402	2024-08-16 13:34:07 -07:00
Alexander Motin	5b9f3b7664	Soften pruning threshold on not evictable metadata Previous code pruned 10% of dnodes once 3/4 of metadata appeared unevictable. On workloads with many millions of dnodes and little other metadata it creates significant load spikes for many seconds straight. This change instead gradually increases pruning as unevictable metadata grow above the 3/4, which may allow it to stabilize at some level. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16401	2024-08-08 15:26:35 -07:00
Alexander Motin	aef452f108	Improve zfs_blkptr_verify() - Skip config lock enter/exit for embedded blocks. They have no DVAs, so there is nothing to check under the lock. - Skip CHECKSUM check and properly check PSIZE for embedded blocks. - Add static branch predictions for unlikely conditions. - Do not verify DVAs for blocks already in ARC. ARC hit already "verified" the first (often the only) DVA, and it does not worth to enter/exit config lock for nothing. Some profiles show me up to 3% of CPU saving from this change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16387	2024-08-08 15:25:10 -07:00
Alexander Motin	d4b5517ef9	Linux: Report reclaimable memory to kernel as such (#16385 ) Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to mark memory allocations that can be freed via shinker calls. It should allow kernel to tune and group such allocations for lower memory fragmentation and better reclamation under pressure. This patch marks as reclaimable most of ARC memory, directly evictable via ZFS shrinker, plus also dnode/znode/sa memory, indirectly evictable via kernel's superblock shrinker. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com>	2024-07-30 11:40:47 -07:00
Allan Jude	62e7d3c89e	ddt: add support for prefetching tables into the ARC This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Will Andrews <will.andrews@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Closes #15890	2024-07-26 09:16:18 -07:00
Alexander Motin	55427add3c	Several improvements to ARC shrinking (#16197 ) - When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-25 10:31:14 -07:00
George Amanakis	8865dfbcaa	Fix assertion in Persistent L2ARC At the end of l2arc_evict() fix an assertion in the case that l2ad_hand + distance == l2ad_end. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16202 Closes #16207	2024-05-24 19:02:58 -07:00
Don Brady	975a13259b	Add support for parallel pool exports Changed spa_export_common() such that it no longer holds the spa_namespace_lock for the entire duration and instead sets spa_export_thread to indicate an import is in progress on the spa. This allows for an export to a diffent pool to proceed in parallel while an export is still processing potentially long operations like spa_unload_log_sm_flush_all(). Calls like spa_lookup() and spa_vdev_enter() that rely on the spa_namespace_lock to serialize them against a concurrent export, now wait for any in-progress export thread to complete before proceeding. The 'zpool import -a' sub-command also provides multi-threaded support, using a thread pool to submit the exports in parallel. Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16153	2024-05-14 08:57:41 -07:00
Alexander Motin	1f940de072	L2ARC: Cleanup buffer re-compression When compressed ARC is disabled, we may have to re-compress when writing into L2ARC. If doing so we can't fit it into the original physical size, we should just fail immediately, since even if it may still fit into allocation size, its checksum will never match. While there, refactor the code similar to other compression places without using abd_return_buf_copy(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16038	2024-04-23 09:06:00 -07:00
Rich Ercolani	e5e2a5a3b8	Add custom debug printing for your asserts Being able to print custom debug information on assert trip seems useful. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #15792	2024-04-10 13:30:25 -07:00
Alexander Motin	997f85b4d3	L2ARC: Relax locking during write Previous code held ARC state sublist lock throughout all L2ARC write process, which included number of allocations and even ZIO issues. Being blocked in any of those places the code could also block ARC eviction, that could cause OOM activation or even dead- lock if system is low on memory or one is too fragmented. Fix it by dropping the lock as soon as we see a block eligible for L2ARC writing and pick it up later using earlier inserted marker. While there, also reduce scope of hash lock, moving ZIO allocation and other operations not requiring header access out of it. All operations requiring header access move under hash lock, since L2_WRITING flag does not prevent header eviction only transition to arc_l2c_only state with L1 header. To be able to manipulate sublist lock and marker as needed add few more multilist functions and modify one. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16040	2024-04-09 16:23:19 -07:00
George Wilson	493fcce9be	Provide macros for setting and getting blkptr birth times There exist a couple of macros that are used to update the blkptr birth times but they can often be confusing. For example, the BP_PHYSICAL_BIRTH() macro will provide either the physical birth time if it is set or else return back the logical birth time. The complement to this macro is BP_SET_BIRTH() which will set the logical birth time and set the physical birth time if they are not the same. Consumers may get confused when they are trying to get the physical birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that the logical birth time is what is actually returned. This change cleans up these macros and makes them symmetrical. The same functionally is preserved but the name is changed. Instead of calling BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In additional to cleaning up this naming conventions, two new sets of macros are introduced -- BP_[SET\|GET]_LOGICAL_BIRTH() and BP_[SET\|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to get and set the specific birth time. As part of the cleanup, the unused GRID macros have been removed and that portion of the blkptr are currently unused. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #15962	2024-03-25 15:01:54 -07:00
Rob N	4836d293c0	zfs_refcount_remove: explictly ignore returns Coverity noticed that sometimes we ignore the return, and sometimes we don't. Its not wrong, and I like consistent style, so here we are. Reported-by: Coverity (CID-1564584) Reported-by: Coverity (CID-1564585) Reported-by: Coverity (CID-1564586) Reported-by: Coverity (CID-1564587) Reported-by: Coverity (CID-1564588) Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15647	2023-12-07 08:21:38 -08:00

1 2 3 4 5 ...

473 Commits