mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-06-09 23:46:38 +03:00

Author	SHA1	Message	Date
Ameer Hamza	637f918211	arc: avoid possible deadlock in arc_read In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17071	2025-02-28 00:42:29 +05:00
Paul Dagnelie	7e72312eff	Don't try to get mg of hole vdev in removal Don't try to get mg of hole vdev in removal Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17080	2025-02-28 00:42:29 +05:00
aokblast	383256c329	spa: fix signature mismatch for spa_boot_init as eventhandler required Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17088	2025-02-28 00:42:29 +05:00
Alexander Motin	c2668b2d10	Better fill empty metaslabs Before this change zfs_metaslab_switch_threshold tunable switched metaslabs each time ones index reduced by two (which means biggest contiguous chunk reduced to 1/4). It is a good idea to balance metaslabs fragmentation. But for empty metaslabs (having power- of-2 sizes) this means switching when they get just below the half of their capacity. Inspection with zdb after filling new pool to half capacity shown most of its metaslabs filled to half capacity. I consider this sub-optimal for pool fragmentation in a long run. This change blocks the metaslabs switching if most of the metaslab free space (15/16) is represented by a single contiguous range. Such metaslab should not be considered fragmented until it actually fail some big allocation. More contiguous filling should improve data locality and increase time before previously filled and partially freed metaslab is touched again, giving it more time to free more contiguous chunks for lower fragmentation. It should also slightly reduce spacemap traffic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17081	2025-02-28 00:42:29 +05:00
Rob Norris	7ea899be04	vdev_file: make FLUSH and TRIM asynchronous zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the zio_taskq thread is active and blocked both while waiting for the IO call and then while calling zio_execute() for the next stage. This is a particular issue for FLUSH, as the z_flush_iss queue typically only has one thread; multiple flushes arriving at once can cause long delays if the underlying fsync() response is particularly slow. To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq, just as we do for reads and writes. Further, we return all results through zio_interrupt(), so neither the issue nor the file taskqs are blocked. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17064	2025-02-28 00:42:29 +05:00
Chunwei Chen	e085d66f7a	Fix wrong free function in arc_hdr_decrypt Need to use arc_free_data_abd to free abd type buffer. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #17079	2025-02-28 00:42:29 +05:00
Rob Norris	af062c480c	vdev_file: unify FreeBSD and Linux implementations (#17046 ) Kernel & userspace specifics are in zfs_file_os.c, so there's no particular reason these have to be separate. The one platform-specific part is in the Linux kernel part, to offload flushes to a taskq if we're already inside a filesystem transaction. This would be normally be an unsatisfying wart, but I'm intending to remove this shortly, so I'm content to leave it gated for the moment. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2025-02-28 00:42:29 +05:00
Alexander Motin	f2ab5b82da	Fix metaslab group fragmentation math (#17037 ) Since we are calculating a free space fragmentation, we should weight metaslabs by the amount of their free space, not a full size. Fragmentation of full metaslabs may not matter in presence empty ones. The old algorithm did not differentiate metaslabs having only one free 4KB block from metaslabs having 50% of space free in 4KB blocks, reporting higher fragmentation. While there, move metaslab_group_alloc_update() call after setting mg_fragmentation, otherwise the effect may be delayed by one TXG. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-28 00:42:29 +05:00
Rob Norris	1bdce0410c	range_tree: convert remaining range_* defs to zfs_range_* Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-28 00:42:29 +05:00
Ivan Volosyuk	55b21552d3	Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_* Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-28 00:42:29 +05:00
Rob Norris	0be3b266ed	zio: do no-op injections just before handing off to vdevs The purpose of no-op is to simulate a failure between a device cache and its permanent store. We still want it to go through the queue and respond in the same way to everything else. So, inject "success" as the very last thing, and then move on to VDEV_IO_DONE to be dequeued and so any followup work can occur. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17029	2025-02-28 00:42:29 +05:00
Paul Dagnelie	a0f8d3c584	Add kstats tracking gang allocations Gang blocks have a significant impact on the long and short term performance of a zpool, but there is not a lot of observability into whether they're being used. This change adds gang-specific kstats to ZFS, to better allow users to see whether ganging is happening. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17003	2025-02-28 00:42:29 +05:00
Paul Dagnelie	2adca179b6	Expand fragmentation table to reflect larger possibile allocation sizes When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance. We expand the fragmentation table to extend up to 512MB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #16986	2025-02-28 00:42:29 +05:00
Rob Norris	3b5c3f52d2	zio: lock parent zios when updating wait counts on reexecute As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17016	2025-02-28 00:42:29 +05:00
Jaydeep Kshirsagar	67f0469f70	Avoid ARC buffer transfrom operations in prefetch This change will prevent prefetch to perform unnecessary ARC buffer fill when reading from disk. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17013	2025-02-28 00:42:29 +05:00
Alan Somers	6e9911212e	Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #16980	2025-02-25 22:32:11 +05:00
Rob Norris	a28f5a94f4	zinject: add "probe" device injection type Injecting a device probe failure is not possible by matching IO types, because probe IO goes to the label regions, which is explicitly excluded from injection. Even if it were possible, it would be awkward to do, because a probe is sequence of reads and writes. This commit adds a new IO "type" to match for injection, which looks for the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the injection record and recieve the wanted error. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-02-25 22:29:33 +05:00
Rob Norris	0dfcfe023e	zinject: make iotype extendable I'm about to add a new "type", and I need somewhere to put it! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-02-25 22:29:02 +05:00
Rob Norris	fabdd502f4	zinject: count matches and injections for each handler When building tests with zinject, it can be quite difficult to work out if you're producing the right kind of IO to match the rules you've set up. So, here we extend injection records to count the number of times a handler matched the operation, and how often an error was actually injected (ie after frequency and other exclusions are applied). Then, display those counts in the `zinject` output. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16938	2025-02-25 22:25:24 +05:00
Brian Behlendorf	bc06d8164b	Linux: Enable Direct IO by default Aligns the 2.3 release branch with the well tested default behavior in the master branch. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-01-13 13:53:41 -08:00
Don Brady	25565403aa	Too many vdev probe errors should suspend pool Similar to what we saw in #16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16864	2025-01-04 11:58:15 -08:00
Rob Norris	4425a7bb85	vdev_open: clear async remove flag after reopen It's possible for a vdev to be flagged for async remove after the pool has suspended. If the removed device has been returned when the pool is resumed, the ASYNC_REMOVE task will still run at the end of txg, and remove the device from the pool again. To fix, we clear the async remove flag at reopen, just as we did for the async fault flag in `5de3ac223`. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16921	2025-01-03 15:23:49 -08:00
Ameer Hamza	b952e061df	zvol: implement platform-independent part of block cloning In Linux, block devices currently lack support for `copy_file_range` API because the kernel does not provide the necessary functionality. However, there is an ongoing upstream effort to address this limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/. We have adopted this upstream kernel patch into the TrueNAS kernel and made some additional modifications to enable block cloning specifically for the zvol block device. This patch implements the platform- independent portions of these changes for inclusion in OpenZFS. This patch does not introduce any new functionality directly into OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when zvols are migrated to non-TrueNAS systems that support Clone Range replay in the ZIL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16901	2024-12-29 11:53:45 -08:00
Rob Norris	9519e7ebcc	microzap: set hard upper limit of 1M The count of chunks in a microzap block is stored as an uint16_t (mze_chunkid). Each chunk is 64 bytes, and the first is used to store a header, so there are 32767 usable chunks, which is just under 2M. 1M is the largest power-2-rounded block size under 2M, so we must set the limit there. If it goes higher, the loop in mzap_addent can overflow and fall into the PANIC case. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16888	2024-12-29 11:53:45 -08:00
Alexander Motin	f9b02fe7e3	Fix readonly check for vdev user properties VDEV_PROP_USERPROP is equal do VDEV_PROP_INVAL and so is not a real property. That's why vdev_prop_readonly() does not work right for it. In particular it may declare all vdev user properties readonly on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16890	2024-12-29 11:53:45 -08:00
Alexander Motin	696943533c	Fix use-afer-free regression in RAIDZ expansion We should not dereference rra after the last zio_nowait() is called. It seems very unlikely, but ASAN in ztest managed to catch it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16868	2024-12-16 10:26:39 -08:00
Rob Norris	e1833a72f9	flush: only detect lack of flush support in one place It seems there's no good reason for vdev_disk & vdev_geom to explicitly detect no support for flush and set vdev_nowritecache. Instead, just signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess() take care of it in one place. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16855	2024-12-16 10:26:30 -08:00
Rob Norris	5bb034f533	flush: don't report flush error when disabling flush support The first time a device returns ENOTSUP in repsonse to a flush request, we set vdev_nowritecache so we don't issue flushes in the future and instead just pretend the succeeded. However, we still return an error for the initial flush, even though we just decided such errors are meaningless! So, when setting vdev_nowritecache in response to a flush error, also reset the error code to assume success. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16855	2024-12-16 10:26:27 -08:00
Chunwei Chen	2dcc8fe035	Fix DR_OVERRIDDEN use-after-free race in dbuf_sync_leaf In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db except for overridden case. However, this exception causes a race where dbuf_new_size could free the arc_buf after the last dereference of datap and causes use-after-free. We fix this by cloning the buf regardless if it's overridden. The race: -- P0 P1 dbuf_hold_impl() // dbuf_hold_copy passed // because db_data_pending NULL dbuf_sync_leaf() // doesn't clone datap // datap derefed to db_buf dbuf_write(datap) dbuf_new_size() dmu_buf_will_dirty() dbuf_fix_old_data() // alloc new buf for P0 dr // but can't change *datap arc_alloc_buf() arc_buf_destroy() // alloc new buf for db_buf // and destroy old buf dbuf_write() // continue abd_get_from_buf(data->b_data, arc_buf_size(data)) // use-after-free -- Here's an example when it happens: BUG: kernel NULL pointer dereference, address: 000000000000002e RIP: 0010:arc_buf_size+0x1c/0x30 [zfs] Call Trace: dbuf_write+0x3ff/0x580 [zfs] dbuf_sync_leaf+0x13c/0x530 [zfs] dbuf_sync_list+0xbf/0x120 [zfs] dnode_sync+0x3ea/0x7a0 [zfs] sync_dnodes_task+0x71/0xa0 [zfs] taskq_thread+0x2b8/0x4e0 [spl] kthread+0x112/0x130 ret_from_fork+0x1f/0x30 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #16854	2024-12-12 16:20:30 -08:00
Alexander Motin	38875918d8	BRT: Check bv_mos_entries in brt_entry_lookup() When vdev first sees some block cloning, there is a window when brt_maybe_exists() might already return true since something was cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet created. In such case we should not try to look into the ZAP and dereference NULL bv_mos_entries_dnode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16851	2024-12-12 16:20:30 -08:00
Rob Norris	0d51852ec7	Remove unnecessary CSTYLED escapes on top-level macro invocations cstyle can handle these cases now, so we don't need to disable it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16840	2024-12-06 09:05:02 -08:00
Alexander Motin	7742e29387	Optimize RAIDZ expansion - Instead of copying one ashift-sized block per ZIO, copy as much as we have contiguous data up to 16MB per old vdev. To avoid data moves use gang ABDs, so that read ZIOs can directly fill buffers for write ZIOs. ABDs have much smaller overhead than ZIOs in both memory usage and processing time, plus big I/Os do not depend on I/O aggregation and scheduling to reach decent performance on HDDs. - Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms. - Use 32bit range tree when possible (practically always now) to slightly reduce memory usage. - Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as for main ones. - Fix rate overflows in `zpool status` reporting. With these changes expanding RAIDZ1 from 4 to 5 children I am able to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are limited by devices instead of CPU. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15680 Closes #16819	2024-12-06 09:05:02 -08:00
Alexander Motin	f54052a122	Fix false assertion in dmu_tx_dirty_buf() on cloning Same as writes block cloning can increase block size and number of indirection levels. That means it can dirty block 0 at level 0 or at new top indirection level without explicitly holding them. A block cloning test case for large offsets has been added. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16825	2024-12-05 11:49:06 -08:00
Alexander Motin	84d7d53e91	Improve speculative prefetcher for block cloning - Issue prescient prefetches for demand indirect blocks after the first one. It should be quite rare for reads/writes, but much more useful for cloning due to much bigger (up to 1022 blocks) accesses. It covers the gap during the first couple accesses when we can not speculate yet, but we know what is needed right now. It reduces dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode(). - Increase maximum prefetch distance for indirect blocks from 64 to 128MB. It should cover the maximum 1022 blocks of block cloning access size in case of default 128KB recordsize used. In case of bigger recordsize the above prescient prefetch should also help. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16814	2024-12-05 09:33:21 -08:00
Alexander Motin	d90042dedb	Allow dsl_deadlist_open() return errors In some cases like dsl_dataset_hold_obj() it is possible to handle those errors, so failure to hold dataset should be better than kernel panic. Some other places where these errors are still not handled but asserted should be less dangerous just as unreachable. We have a user report about pool corruption leading to assertions on these errors. Hopefully this will make behavior a bit nicer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16836	2024-12-05 09:33:21 -08:00
Mariusz Zaborski	3b0c1131ef	Add ability to scrub from last scrubbed txg Some users might want to scrub only new data because they would like to know if the new write wasn't corrupted. This PR adds possibility scrub only newly written data. This introduces new `last_scrubbed_txg` property, indicating the transaction group (TXG) up to which the most recent scrub operation has checked and repaired the dataset, so users can run scrub only from the last saved point. We use a scn_max_txg and scn_min_txg which are already built into scrub, to accomplish that. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Closes #16301	2024-12-05 09:33:21 -08:00
Alexander Motin	b17ea73f9d	Fix regression in dmu_buf_will_fill() Direct I/O implementation added condition to call dbuf_undirty() only in case of block cloning. But the condition is not right if the block is no longer dirty in this TXG, but still in DB_NOFILL state. It resulted in block not reverting to DB_UNCACHED and following NULL de-reference on attempt to access absent db_data. While there, add assertions for db_data to make debugging easier. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16829	2024-12-02 18:14:26 -08:00
Pavel Snajdr	a1907b038a	Assert if we're logging after final txg was set This allowed to debug #16714, fixed in #16782. Without assertions added here it is difficult to figure out what logs cause the problem, since the assertion happens in sync thread context. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #16795	2024-12-02 18:14:26 -08:00
Alexander Motin	c165daa0b1	BRT: Clear bv_entcount_dirty on destroy This fixes assertion in brt_sync_table() on debug builds when last cloned block on the vdev is freed and bv_meta_dirty is cleared, while bv_entcount_dirty is not. Should not matter in production. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16791	2024-11-21 08:24:37 -08:00
Alexander Motin	1a5414ba2f	BRT: More optimizations after per-vdev splitting - With both pending and current AVL-trees being per-vdev and having effectively identical comparison functions (pending tree compared also birth time, but I don't believe it is possible for them to be different for the same offset within one transaction group), it makes no sense to move entries from one to another. Instead inline dramatically simplified brt_entry_addref() into brt_pending_apply(). It no longer requires bv_lock, since there is nothing concurrent to it at the time. And it does not need to search the tree for the previous entries, since it is the same tree, we already have the entry and we know it is unique. - Put brt_vdev_lookup() and brt_vdev_addref() into different tree traversals to avoid false positives in the first due to the second entcount modifications. It saves dramatic amount of time when a file cloned first time by not looking for non-existent ZAP entries. - Remove avl_is_empty(bv_tree) check from brt_maybe_exists(). I don't think it is needed, since by the time all added entries are already accounted in bv_entcount. The extra check must be producing too many false positives for no reason. Also we don't need bv_lock there, since bv_entcount pointer must be table at this point, and we don't care about false positive races here, while false negative should be impossible, since all brt_vdev_addref() have already completed by this point. This dramatically reduces lock contention on massive deletes of cloned blocks. The only remaining one is between multiple parallel free threads calling brt_entry_decref(). - Do not update ZAP if net change for a block over the TXG was 0. In combination with above it makes file move between datasets as cheap operation as originally intended if it fits into one TXG. - Do not allocate vdevs on pool creation or import if it did not have active block cloning. This allows to save a bit in few cases. - While here, add proper error handling in brt_load() on pool import instead of assertions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16773	2024-11-21 08:24:37 -08:00
Alexander Motin	409aad3f33	BRT: Rework structures and locks to be per-vdev While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-21 08:24:37 -08:00
Alexander Motin	1917c26944	ZAP: Add by_dnode variants to lookup/prefetch_uint64 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-21 08:24:37 -08:00
Alexander Motin	2b64d41be8	BRT: Don't call brt_pending_remove() on holes/embedded We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-21 08:24:37 -08:00
Mark Johnston	d7abeef621	zio: Avoid sleeping in the I/O path zio_delay_interrupt(), apparently used for fault injection, is executed in the I/O pipeline. It can cause the calling thread to go to sleep, which is not allowed on FreeBSD. This happens only for small delays, though, and there's no apparent reason to avoid deferring to a taskqueue in that case, as it already does otherwise. Simply go to sleep unconditionally. This fixes an occasional panic I see when running the ZTS on FreeBSD. Also remove an unhelpful comment referencing the non-existent timeout_generic(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #16785	2024-11-21 08:24:37 -08:00
Alexander Motin	3f9af023f6	L2ARC: Stop rebuild before setting spa_final_txg Without doing that there is a race window on export when history log write by completed rebuild dirties transaction beyond final, triggering assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16714 Closes #16782	2024-11-21 08:24:37 -08:00
Alexander Motin	f7675ae30f	Remove hash_elements_max accounting from DBUF and ARC Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16759	2024-11-21 08:24:37 -08:00
Rob Norris	920603990a	Move "no name changes" from compression to checksum table Compression names actually aren't used in dedup table names, but checksum names are. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16776	2024-11-21 08:24:37 -08:00
Rob Norris	0274a9a57d	dsl_dataset: put IO-inducing frees on the pool deadlist dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #6783 Closes #16708 Closes #16722 Closes #16697	2024-11-14 16:52:02 -08:00
Alexander Motin	025f8b2e74	L2ARC: Move different stats updates earlier ..., before we make the header or the log block visible to others. It should fix assertion on allocated space going negative if the header is freed once the lock is dropped, while the write is still going. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16040 Closes #16743	2024-11-14 16:51:58 -08:00
Chunwei Chen	c82eb27b22	ZFS send should use spill block prefetched from send_reader_thread Currently, even though send_reader_thread prefetches spill block, do_dump() will not use it and issues its own blocking arc_read. This causes significant performance degradation when sending datasets with lots of spill blocks. For unmodified spill blocks, we also create send_range struct for them in send_reader_thread and issue prefetches for them. We piggyback them on the dnode send_range instead of enqueueing them so we don't break send_range_after check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: david.chen <david.chen@nutanix.com> Closes #16701	2024-11-06 11:54:32 -08:00

1 2 3 4 5 ...

3304 Commits