mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Tony Hutter	786641dcf9	Workaround UBSAN errors for variable arrays This gets around UBSAN errors when using arrays at the end of structs. It converts some zero-length arrays to variable length arrays and disables UBSAN checking on certain modules. It is based off of the patch from #15460. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Co-authored-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #15145 Closes #15510	2023-11-12 16:26:07 -08:00
Alexander Motin	3a8d9b8487	Linux: Reclaim unused spl_kmem_cache_reclaim It is unused for 3 years since #10576. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15507	2023-11-10 10:34:46 -08:00
shodanshok	887a3c533b	Increase L2ARC write rate and headroom Current L2ARC write rate and headroom parameters are very conservative: l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at 8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs at 2x these rates). These values were selected 15+ years ago based on then-current SSDs size, performance and endurance. Today we have multi-TB, fast and cheap SSDs which can sustain much higher read/write rates. For this reason, this patch increases l2arc_write_max to 32M and l2arc_headroom to 8 (4x increase for both). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #15457	2023-11-08 16:30:47 -08:00
Low-power	a160c153e2	Linux: reject read/write mapping to immutable file only on VM_SHARED Private read/write mapping can't be used to modify the mapped files, so they will remain be immutable. Private read/write mappings are usually used to load the data segment of executable files, rejecting them will rendering immutable executable files to stop working. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #15344	2023-11-08 12:19:38 -08:00
Don Brady	5caeef02fa	RAID-Z expansion feature This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Authored-by: Matthew Ahrens <mahrens@delphix.com> Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com> Contributions-by: Stuart Maybee <stuart.maybee@comcast.net> Contributions-by: Thorsten Behrens <tbehrens@outlook.com> Contributions-by: Fmstrat <nospam@nowsci.com> Contributions-by: Don Brady <dev.fs.zfs@gmail.com> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #15022	2023-11-08 10:19:41 -08:00
Umer Saleem	9198de8f10	Linux 6.6 compat: fix implicit conversion error with debug build With Linux v6.6.0 and GCC 12, when debug build is configured, implicit conversion error is raised while converting 'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of 'true' to fix the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15489	2023-11-07 13:24:16 -08:00
Gordon Tetlow	dc45a00eac	Add kern.features.zfs Add a ZFS feature flag to indicate OpenZFS availability. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gordon Tetlow <gordon@freebsd.org> Closes #15484	2023-11-07 13:21:56 -08:00
Jason King	3d86999c75	sa_lookup() ignores buffer size. When retrieving a system attribute, the size of the supplied buffer is ignored. If the buffer is too small to hold the attribute, sa_attr_op() will write past the end of the buffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jking@racktopsystems.com> Closes #15476	2023-11-07 12:11:48 -08:00
Alexander Motin	020f6fd093	FreeBSD: Implement taskq_init_ent() Previously taskq_init_ent() was an empty macro, while actual init was done by taskq_dispatch_ent(). It could be slightly faster in case taskq never enqueued. But without it taskq_empty_ent() relied on the structure being zeroed by somebody else, that is not good. As a side effect this allows the same task to be queued several times, that is normal on FreeBSD, that may or may not get useful here also one day. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15455	2023-11-07 11:37:18 -08:00
Alexander Motin	58398cbd03	FreeBSD: Optimize large kstat outputs - Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl output. - Use much faster sbuf_cat() instead of sbuf_printf("%s"). Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes to seconds, making dbufstat almost usable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15495	2023-11-07 11:35:40 -08:00
Alan Somers	e36ff84c33	Update the kstat dataset_name when renaming a zvol Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15482 Closes #15486	2023-11-07 11:34:50 -08:00
Mark Johnston	f4cd1bac72	Make abd_raidz_gen_iterate() pass an initialized pointer to the callback Otherwise callbacks may trigger KMSAN violations in the dlen == 0 case. For example, raidz_syn_pq_abd() will compare an uninitialized pointer with itself before returning. This seems harmless, but let's maintain good hygiene and avoid passing uninitialized variables, if only to placate KMSAN. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15491	2023-11-07 10:24:15 -08:00
MigeljanImeri	2a154b8484	Fix accounting error for pending sync IO ops in zpool iostat Currently vdev_queue_class_length is responsible for checking how long the queue length is, however, it doesn't check the length when a list is used, rather it just returns whether it is empty or not. To fix this I added a counter variable to vdev_queue_class to keep track of the sync IO ops, and changed vdev_queue_class_length to reference this variable instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: MigeljanImeri <ImeriMigel@gmail.com> Closes #15478	2023-11-07 09:06:14 -08:00
ednadolski-ix	3bd4df3841	Improve ZFS objset sync parallelism As part of transaction group commit, dsl_pool_sync() sequentially calls dsl_dataset_sync() for each dirty dataset, which subsequently calls dmu_objset_sync(). dmu_objset_sync() in turn uses up to 75% of CPU cores to run sync_dnodes_task() in taskq threads to sync the dirty dnodes (files). There are two problems: 1. Each ZVOL in a pool is a separate dataset/objset having a single dnode. This means the objsets are synchronized serially, which leads to a bottleneck of ~330K blocks written per second per pool. 2. In the case of multiple dirty dnodes/files on a dataset/objset on a big system they will be sync'd in parallel taskq threads. However, it is inefficient to to use 75% of CPU cores of a big system to do that, because of (a) bottlenecks on a single write issue taskq, and (b) allocation throttling. In addition, if not for the allocation throttling sorting write requests by bookmarks (logical address), writes for different files may reach space allocators interleaved, leading to unwanted fragmentation. The solution to both problems is to always sync no more and (if possible) no fewer dnodes at the same time than there are allocators the pool. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Closes #15197	2023-11-06 10:38:42 -08:00
Martin Matuška	763ca47fa8	Fix block cloning between unencrypted and encrypted datasets Block cloning from an encrypted dataset into an unencrypted dataset and vice versa is not possible. The current code did allow cloning unencrypted files into an encrypted dataset causing a panic when these were accessed. Block cloning between encrypted and encrypted is currently supported on the same filesystem only. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Rob N <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #15464 Closes #15465	2023-10-31 13:49:41 -07:00
Ameer Hamza	9ccdb8becd	zvol: fix delayed update to block device ro entry The change in the zvol readonly property does not update the block device readonly entry until the first IO to the ZVOL. This patch addresses the issue by updating the block device readonly property from the set property IOCTL call. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:50:38 -07:00
Ameer Hamza	60387facd2	zvol: Implement zvol threading as a Property Currently, zvol threading can be switched through the zvol_request_sync module parameter system-wide. By making it a zvol property, zvol threading can be switched per zvol. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:50:32 -07:00
Ameer Hamza	dbe839a9ca	zvol: Cleanup set property zvol_set_volmode() and zvol_set_snapdev() share a common code path. Merge this shared code path into zvol_set_common(). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15409	2023-10-31 09:49:32 -07:00
Alexander Motin	799e09f75a	Unify arc_prune_async() code There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15456	2023-10-30 16:56:04 -07:00
Alexander Motin	514d661ca1	Tune zio buffer caches and their alignments We should not always use PAGESIZE alignment for caches bigger than it and SPA_MINBLOCKSIZE otherwise. Doing that caches for 5, 6, 7, 10 and 14KB rounded up to 8, 12 and 16KB respectively make no sense. Instead specify as alignment the biggest power-of-2 divisor. This way 2KB and 6KB caches are both aligned to 2KB, while 4KB and 8KB are aligned to 4KB. Reduce number of caches to half-power of 2 instead of quarter-power of 2. This removes caches difficult for underlying allocators to fit into page-granular slabs, such as: 2.5, 3.5, 5, 7, 10KB, etc. Since these caches are mostly used for transient allocations like ZIOs and small DBUF cache it does not worth being too aggressive. Due to the above alignment issue some of those caches were not working properly any way. 6KB cache now finally has a chance to work right, placing 2 buffers into 3 pages, that makes sense. Remove explicit alignment in Linux user-space case. I don't think it should be needed any more with the above fixes. As result on FreeBSD instead of such numbers of pages per slab: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_14336.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_10240.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_7168.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 2 <= Broken vm.uma.zio_buf_comb_5120.keg.ppera: 2 vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3584.keg.ppera: 7 <= Hard to free vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2560.keg.ppera: 2 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 I am now getting such: vm.uma.zio_buf_comb_16384.keg.ppera: 4 vm.uma.zio_buf_comb_12288.keg.ppera: 3 vm.uma.zio_buf_comb_8192.keg.ppera: 2 vm.uma.zio_buf_comb_6144.keg.ppera: 3 <= Fixed, 2 in 3 pages vm.uma.zio_buf_comb_4096.keg.ppera: 1 vm.uma.zio_buf_comb_3072.keg.ppera: 3 vm.uma.zio_buf_comb_2048.keg.ppera: 1 vm.uma.zio_buf_comb_1536.keg.ppera: 2 vm.uma.zio_buf_comb_1024.keg.ppera: 1 vm.uma.zio_buf_comb_512.keg.ppera: 1 Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15452	2023-10-30 14:55:32 -07:00
Alexander Motin	05a7348a7e	RAIDZ: Use cache blocking during parity math RAIDZ parity is calculated by adding data one column at a time. It works OK for small blocks, but for large blocks results of previous addition may already be evicted from CPU caches to main memory, and in addition to extra memory write require extra read to get it back. This patch splits large parity operations into 64KB chunks, that should in most cases fit into CPU L2 caches from the last decade. I haven't touched more complicated cases of data reconstruction to not over complicate the code. Those should be relatively rare. My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/ RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache should see the effect even on smaller blocks. Wider vdevs may need bigger blocks to be affected. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15448	2023-10-30 14:54:27 -07:00
Alexander Motin	c3773de168	ZIL: Cleanup sync and commit handling ZVOL: - Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only one object, it makes no sense to maintain async queue and on each commit merge it into sync. Single sync queue is just cheaper, while it changes nothing until actual commit request arrives. - Remove zsd_sync_cnt and the zil_async_to_sync() calls since we are no longer switching between sync and async queues. ZFS: - Mark write transactions as sync based only on number of sync opens (z_sync_cnt). We can not randomly jump between sync and async unless we want data corruptions due to writes reordering. - When file first opened with O_SYNC (z_sync_cnt incremented to 1) call zil_async_to_sync() for it to preserve correct ordering between past and future writes. - Drop zfs_fsyncer_key logic. Looks like it was an optimization for workloads heavily intermixing async writes with tons of fsyncs. But first it was broken 8 years ago due to Linux tsd implementation not allowing data storage between syscalls, and second, I doubt it is safe to switch from async to sync so often and without calling zil_async_to_sync(). - Rename sync argument of *_log_write() into commit, now only signalling caller's intent to call zil_commit() soon after. It allows WR_COPIED optimizations without extra other meanings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15366	2023-10-30 14:51:56 -07:00
shodanshok	043c6ee3b6	Read prefetched buffers from L2ARC Prefetched buffers are currently read from L2ARC if, and only if, l2arc_noprefetch is set to non-default value of 0. This means that a streaming read which can be served from L2ARC will instead engage the main pool. For example, consider what happens when a file is sequentially read: - application requests contiguous data, engaging the prefetcher; - ARC buffers are initially marked as prefetched but, as the calling application consumes data, the prefetch tag is cleared; - these "normal" buffers become eligible for L2ARC and are copied to it; - re-reading the same file will not engage L2ARC even if it contains the required buffers; - main pool has to suffer another sequential read load, which (due to most NCQ-enabled HDDs preferring sequential loads) can dramatically increase latency for uncached random reads. In other words, current behavior is to write data to L2ARC (wearing it) without using the very same cache when reading back the same data. This was probably useful many years ago to preserve L2ARC read bandwidth but, with current SSD speed/size/price, it is vastly sub-optimal. Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads, means that even prefetched but unused buffers will be copied into L2ARC, further increasing wear and load for potentially not-useful data. This patch enable prefetched buffer to be read from L2ARC even when l2arc_noprefetch=1 (default), increasing sequential read speed and reducing load on the main pool without polluting L2ARC with not-useful (ie: unused) prefetched data. Moreover, it clear users confusion about L2ARC size increasing but not serving any IO when doing sequential reads. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #15451	2023-10-26 09:40:21 -07:00
Thomas Bertschinger	97a0b5be50	Add mutex_enter_interruptible() for interruptible sleeping IOCTLs Many long-running ZFS ioctls lock the spa_namespace_lock, forcing concurrent ioctls to sleep for the mutex. Previously, the only option is to call mutex_enter() which sleeps uninterruptibly. This is a usability issue for sysadmins, for example, if the admin runs `zpool status` while a slow `zpool import` is ongoing, the admin's shell will be locked in uninterruptible sleep for a long time. This patch resolves this admin usability issue by introducing mutex_enter_interruptible() which sleeps interruptibly while waiting to acquire a lock. It is implemented for both Linux and FreeBSD. The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to use this new macro so that the command can be interrupted if it is issued during a concurrent `zpool import` (or other long-running operation). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Thomas Bertschinger <bertschinger@lanl.gov> Closes #15360	2023-10-26 09:17:40 -07:00
ednadolski-ix	6a629f3234	arc_default_max on Linux should match FreeBSD Commits `518b487` and `23bdb07` changed the default ARC size limit on Linux systems to 1/2 of physical memory, which has become too strict for modern systems with large amounts of RAM. This patch changes the default limit to match that of FreeBSD, so ZFS may have a unified value on both platforms. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Closes #15437	2023-10-26 09:13:01 -07:00
Alexander Motin	3afdc97d91	ZIO: Remove READY pipeline stage from root ZIOs zio_root() has no arguments for ready callback or parent ZIO. Except one recent case in ZIL code if root ZIOs ever have a parent it is also a root ZIO. It means we do not need READY pipeline stage for them, which takes some time to process, but even more time to wait for the children and be woken by them, and both for no good reason. The most visible effect of this change is that it avoids one taskq wakeup per ZIL block written, previously used to run zio_ready() for lwb_root_zio and skipped now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15398	2023-10-25 15:22:25 -07:00
Tony Hutter	05c4710e89	Revert "zvol: Temporally disable blk-mq" This reverts commit `aefb6a2bd6`. `aefb6a2bd` temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15439	2023-10-24 14:41:25 -07:00
Tony Hutter	7c9b6fed16	zvol: Remove broken blk-mq optimization This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15351	2023-10-24 14:37:52 -07:00
Alexander Motin	252f46be7d	ZIL: Detect single-threaded workloads ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15381	2023-10-24 14:35:25 -07:00
Alexander Motin	e007908a16	ABD: Be more assertive in iterators Once we verified the ABDs and asserted the sizes we should never see premature ABDs ends. Assert that and remove extra branches from production builds. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15428	2023-10-24 14:33:58 -07:00
Brian Behlendorf	07345ac252	Add prefetch property ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Co-authored-by: Gionatan Danti <g.danti@assyoma.it> Closes #15237 Closes #15436	2023-10-24 11:00:07 -07:00
Brian Behlendorf	e9725abd83	Revert "Do not persist user/group/project quota zap objects when unneeded" This reverts commit `797f55ef12` which was causing a VERIFY failure when running the project quota tests. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15438	2023-10-23 09:55:36 -07:00
Rob N	b5e6091885	spa: document spa_thread() and SDC feature gates spa_thread() and the "System Duty Cycle" scheduling class are from Illumos and have not yet been adapted to Linux or FreeBSD. HAVE_SPA_THREAD has long been explicitly undefined and used to mark spa_thread(), but there's some related taskq code that can never be invoked without it, which makes some already-tricky code harder to read. HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC requires spa_thread(), but the inverse is not true, so they are separate. I don't want to make the call to just remove it because I still harbour hopes that OpenZFS could become a first-class citizen on Illumos someday. But hopefully this will at least make the reason it exists a bit clearer for people without long memories and/or an interest in history. For those that are interested in the history, the original FreeBSD port of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(), but not SDC. The last version of that before it was removed can be read here: `22df1ffd81/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c` Meanwhile, more information on the SDC scheduling class is here: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/disp/sysdc.c Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15406	2023-10-23 08:50:55 -07:00
Sam Atkinson	797f55ef12	Do not persist user/group/project quota zap objects when unneeded In the zfs_id_over*quota functions, there is a short-circuit to skip the zap_lookup when the quota zap does not exist. If quotas are never used in a zpool, then the quota zap will never exist. But if user/group/project quotas are ever used, the zap objects will be created and will persist even if the quotas are deleted. The quota zap_lookup in the write path can become a bottleneck for write-heavy small I/O workloads. Before this commit, it was not possible to remove this lookup without creating a new zpool. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sam Atkinson <samatk@amazon.com> Closes #14721	2023-10-20 14:22:04 -07:00
Alexander Motin	57b4098562	Trust ARC_BUF_SHARED() more In my understanding ARC_BUF_SHARED() and arc_buf_is_shared() should return identical results, except the second also asserts it deeper. The first is much cheaper though, saving few pointer dereferences. Replace production arc_buf_is_shared() calls with ARC_BUF_SHARED(), and call arc_buf_is_shared() in random assertions, while making it even more strict. On my tests this in half reduces arc_buf_destroy_impl() time, that noticeably reduces hash_lock congestion under heavy dbuf eviction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15397	2023-10-20 12:38:37 -07:00
Alexander Motin	4fbc524955	Remove lock from dsl_pool_need_dirty_delay() Torn reads/writes of dp_dirty_total are unlikely: on 64-bit systems due to register size, while on 32-bit due to memory constraints. And even if we hit some race, the code implementing the delay takes the lock any way. Removal of the poll-wide lock acquisition saves ~1% of CPU time on 8-thread 8KB write workload. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15390	2023-10-20 12:37:16 -07:00
Olivier Certner	b9384b9498	FreeBSD: taskq: Remove unused declaration Variable 'uma_align_cache' has not been used since commit "FreeBSD: Use a hash table for taskqid lookups" (`3933305ea`). Moreover, it is soon going to become private to FreeBSD's UMA in 15.0-CURRENT (main), 14.0-STABLE (stable/14) and 13.2-STABLE (stable/13). Should accessing this information become necessary again, one will have to use the new accessors for recent versions. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Certner <olce.freebsd@certner.fr> Closes #15416	2023-10-20 11:49:56 -07:00
Colin Percival	ea30b5a9e0	Set spa_ccw_fail_time=0 when expanding a vdev. When a vdev is to be expanded -- either via `zpool online -e` or via the autoexpand option -- a SPA_ASYNC_CONFIG_UPDATE request is queued to be handled via an asynchronous worker thread (spa_async_thread). This normally happens almost immediately; but will be delayed up to zfs_ccw_retry_interval seconds (default 5 minutes) if an attempt to write the zpool configuration cache failed. When FreeBSD boots ZFS-root VM images generated using `makefs -t zfs`, the zpoolupgrade rc.d script runs `zpool upgrade`, which modifies the pool configuration and triggers an attempt to write to the cache file. This attempted write fails because the filesystem is still mounted read-only at this point in the boot process, triggering a 5-minute cooldown before SPA_ASYNC_CONFIG_UPDATE requests will be handled by the asynchronous worker thread. When expanding a vdev, reset the "when did a configuration cache write last fail" value so that the SPA_ASYNC_CONFIG_UPDATE request will be handled promptly. A cleaner but more intrusive option would be to use separate SPA_ASYNC_ flags for "configuration changed" and "try writing the configuration cache again", but with FreeBSD 14.0 coming very soon I'd prefer to leave such refactoring for a later date. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colin Percival <cperciva@FreeBSD.org> Closes #15405	2023-10-20 10:30:32 -07:00
John Wren Kennedy	c0e58995e3	Large sync writes perform worse with slog For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com> Closes #14378	2023-10-13 11:15:09 -07:00
Alexander Motin	380c25f640	FreeBSD: Improve taskq wrapper - Group tqent_task and tqent_timeout_task into a union. They are never used same time. This shrinks taskq_ent_t from 192 to 160 bytes. - Remove tqent_registered. Use tqent_id != 0 instead. - Remove tqent_cancelled. Use taskqueue pending counter instead. - Change tqent_type into uint_t. We don't need to pack it any more. - Change tqent_rc into uint_t, matching refcount(9). - Take shared locks in taskq_lookup(). - Call proper taskqueue_drain_timeout() for TIMEOUT_TASK in taskq_cancel_id() and taskq_wait_id(). - Switch from CK_LIST to regular LIST. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15356	2023-10-13 10:41:11 -07:00
Jason King	8a74070128	Zpool can start allocating from metaslab before TRIMs have completed When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jason King <jking@racktopsystems.com> Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395	2023-10-12 11:01:54 -07:00
Alexander Motin	1b310dfb1d	DMU: Do not pre-read holes during write dmu_tx_check_ioerr() pre-reads blocks that are going to be dirtied as part of transaction to both prefetch them and check for errors. But it makes no sense to do it for holes, since there are no disk reads to prefetch and there can be no errors. On the other side those blocks are anonymous, and they are freed immediately by the dbuf_rele() without even being put into dbuf cache, so we just burn CPU time on decompression and overheads and get absolutely no result at the end. Use of dbuf_hold_impl() with fail_sparse parameter allows to skip the extra work, and on my tests with sequential 8KB writes to empty ZVOL with 32KB blocks shows throughput increase from 1.7 to 2GB/s. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15371	2023-10-11 16:37:21 -07:00
Daniel Berlin	bc29124b1b	Ensure we call fput when cloning fails due to different devices. Right now, zpl_ioctl_ficlone and zpl_ioctl_ficlonerange do not call put on the src fd if the source and destination are on two different devices. This leaves the source file held open in this case. Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Daniel Berlin <dberlin@dberlin.org> Closes #15386	2023-10-10 11:04:32 -07:00
Tony Hutter	aefb6a2bd6	zvol: Temporally disable blk-mq There was a report of zvol data loss (#15351) after enabling blk-mq on a zvol backed with 16k physical block sized disks. Out of an abundance of caution, do not allow the user to enable blk-mq until we can look into the issue. Note that blk-mq was not enabled by default on zvols. It was always opt-in via the zvol_use_blk_mq module parameter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Addresses: #15351 Closes #15378	2023-10-10 08:57:48 -07:00
Alexander Motin	008baa091f	FreeBSD: Reduce divergence from in-tree sources This includes random small tweaks, primarily a build fixes, required when ZFS is built as part of FreeBSD base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15368	2023-10-09 13:27:18 -07:00
Alexander Motin	66b81b3497	ZIL: Reduce maximum size of WR_COPIED to 7.5K Benchmarks show that at certain write sizes range lock/unlock take not so much time as extra memory copy. The exact threshold is not obvious due to other overheads, but it is definitely lower than ~63KB used before. Make it configurable, defaulting at 7.5KB, that is 8KB of nearest malloc() size minus itx and lr structs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15353	2023-10-06 10:09:27 -07:00
Rob Norris	54b1b1d893	import: require force when cachefile hostid doesn't match on-disk Previously, if a cachefile is passed to zpool import, the cached config is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and the results are taken as the canonical pool config and handed back to ZFS_IOC_POOL_IMPORT. In the course of its operation, spa_load() will inspect the pool and build a new config from what it finds on disk. However, it then regenerates a new config ready to import, and so rightly sets the hostid and hostname for the local host in the config it returns. Because of this, the "require force" checks always decide the pool is exported and last touched by the local host, even if this is not true, which is possible in a HA environment when MMP is not enabled. The pool may be imported on another head, but the import checks still pass here, so the pool ends up imported on both. (This doesn't happen when a cachefile isn't used, because the pool config is discovered in userspace in zpool_find_import(), and that does find the on-disk hostid and hostname correctly). Since the systemd zfs-import-cache.service unit uses cachefile imports, this can lead to a system returning after a crash with a "valid" cachefile on disk and automatically, quietly, importing a pool that has already been taken up by a secondary head. This commit causes the on-disk hostid and hostname to be included in the ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the "force" checks for zpool import to use them if present. This method should give no change in behaviour for old userspace on new kernels (they won't know to look for the new config items) and for new userspace on old kernels (the won't find the new config items). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15290	2023-10-06 09:24:44 -07:00
Alexander Motin	342357cd9e	Reduce number of metaslab preload taskq threads. Before this change ZFS created threads for 50% of CPUs for each top- level vdev. Plus it created the same number of threads for embedded log groups (that have only one metaslab and don't need any preload). As result, on system with 80 CPUs and pool of 60 vdevs this resulted in 4800 metaslab preload threads, that is absolutely insane. This patch changes the preload threads to 50% of CPUs in one taskq per pool, so on the mentioned system it will be only 40 threads. Among other things this fixes zdb on the mentioned system and pool on FreeBSD, that failed to create so many threads in one process. Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15319	2023-10-06 09:04:00 -07:00
Alexander Motin	75a2eb7fac	ARC: Drop different size headers for crypto To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC headers only when specific block was encrypted on disk. It was a nice optimization, except in some cases the code reallocated them on fly, that invalidated header pointers from the buffers. Since the buffers use different locking, it created number of races, that were originally covered (at least partially) by b_evict_lock, used also to protection evictions. But it has gone as part of #14340. As result, as was found in #15293, arc_hdr_realloc_crypt() ended up unprotected and causing use-after-free. Instead of introducing some even more elaborate locking, this patch just drops the difference between normal and protected headers. It cost us additional 56 bytes per header, but with couple patches saving 24 bytes, the net growth is only 32 bytes with total header size of 232 bytes on FreeBSD, that IMHO is acceptable price for simplicity. Additional locking would also end up consuming space, time or both. Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15293 Closes #15347	2023-10-06 09:01:00 -07:00
Alexander Motin	96b9cf42e0	ARC: Remove b_bufcnt/b_ebufcnt from ARC headers In most cases we do not care about exact number of buffers linked to the header, we just need to know if it is zero, non-zero or one. That can easily be checked just looking on b_buf pointer or in some cases derefencing it. b_ebufcnt is read only once, and in that case we already traverse the list as part of arc_buf_remove(), so second traverse should not be expensive. This reduces L1 ARC header size by 8 bytes and full crypto header by 16 bytes, down to 176 and 232 bytes on FreeBSD respectively. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15350	2023-10-06 08:56:17 -07:00

1 2 3 4 5 ...

4320 Commits