mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Alexander Motin	602b5dca7b	Fix read errors race after block cloning Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Robert Evans <evansr@google.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16052	2024-04-19 10:13:38 -07:00
Alexander Motin	d5fb6abd36	Improve dbuf_read() error reporting Previous code reported non-ZIO errors only via return value, but not via parent ZIO. It could cause NULL-dereference panics due to dmu_buf_hold_array_by_dnode() ignoring the return value, relying solely on parent ZIO status. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reported by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16042	2024-04-19 10:13:38 -07:00
Alexander Motin	39993c3dfe	BRT: Check pool clone stats in more tests This should allow to catch some leaks, if those happen. While there fix some cosmetic issues. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16007	2024-04-19 10:13:38 -07:00
Alexander Motin	e3c1c9153f	BRT: Fix tests to work on non-empty pools It should not normally happen, but if it does, better to not fail everything for no good reason, or it may be hard to debug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16007	2024-04-19 10:13:38 -07:00
Alexander Motin	2ea370a4e3	BRT: Fix holes cloning. - When reading L0 block pointers handle buffers without ones and without dirty records as a holes. Those appear when dnode size was increased, but the end was never written, so there are no new indirection levels to store the pointers. It makes no sense to return EAGAIN here, since sync won't create new indirection levels until there will be actual writes. - When cloning blocks set destination hole logical birth time to the current TXG. Otherwise if we are cloning over existing data, newly created holes may not be properly replicated later. Use BP_SET_BIRTH() when possible to not replicate its logic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15994 Closes #16007	2024-04-19 10:13:38 -07:00
Alexander Motin	3e91a9c525	BRT: Skip getting length in brt_entry_lookup() Unlike DDT, where ZAP values may have different lengths due to compression, all BRT entries are identical 8-byte counters. It does not make sense to first fetch the length only to assert it. zap_lookup_uint64() is specifically designed to work with counters of different size and should return error if something odd found. Calling it straight allows to save some measurable CPU time. Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15950	2024-04-19 10:13:38 -07:00
Alexander Motin	c94f730078	BRT: Make BRT block sizes configurable Similar to DDT make BRT data and indirect block sizes configurable via module parameters. I am not sure what would be the best yet, but similar to DDT 4KB blocks kill all chances of compression on vdev with ashift=12 or more, that on my tests reaches 3x. While here, fix documentation for respective DDT parameters. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15967	2024-04-19 10:13:38 -07:00
Alexander Motin	457e62d7ca	BRT: Relax brt_pending_apply() locking Since brt_pending_apply() is running in syncing context, no other brt_pending_tree accesses are possible for the TXG. We don't need to acquire brt_pending_lock here. Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15955	2024-04-19 10:13:38 -07:00
Alexander Motin	19bf54b764	ZAP: Massively switch to _by_dnode() interfaces Before this change ZAP called dnode_hold() for almost every block access, that was clearly visible in profiler under heavy load, such as BRT. This patch makes it always hold the dnode reference between zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode operations between those. It also adds several new _by_dnode() APIs to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode() variant and uses it in the ZAP code. After this there remains only one call to dmu_buf_dnode_enter(), which seems to be unneeded. So remove the call and the functions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15951	2024-04-19 10:13:38 -07:00
Alexander Motin	fdd8c0aea1	BRT: Skip duplicate BRT prefetches If there is a pending entry for this block, then we've already issued BRT prefetch for it within this TXG, so don't do it again. BRT vdev lookup and following zap_prefetch_uint64() call can be pretty expensive and should be avoided when not necessary. Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15941	2024-04-19 10:13:38 -07:00
Alexander Motin	dced953b62	ZAP: Some cleanups/micro-optimizations - Remove custom zap_memset(), use regular memset(). - Use PANIC() instead of opaque cmn_err(CE_PANIC). - Provide entry parameter to zap_leaf_rehash_entry(). - Reduce branching in zap_leaf_array_create() inner loop. - Remove signedness where it should not be. Should be no function changes. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15976	2024-04-19 10:13:38 -07:00
Alexander Motin	f7c1db6366	BRT: Change brt_pending_tree sorting order It does not look important how exactly brt_pending_tree is sorted. When cloning large file, it is quite likely that all of its blocks have identical physical birth times, so comparing them first does not provide useful entropy, while accesses additional cache line. In most cases combination of vdev and offset provides unique result and physical birth time comparison is not even needed. Meanwhile, when traversing the tree inside brt_pending_apply(), it can be beneficial for dbuf cache and CPU cache hits to group processing by vdev and so by the per-VDEV BRT ZAPs. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15954	2024-04-19 10:13:38 -07:00
Alexander Motin	fa5de0c5cd	Update resume token at object receive. Before this change resume token was updated only on data receive. Usually it is enough to resume replication without much overlap. But we've got a report of a curios case, where replication source was traversed with recursive grep, which through enabled atime modified every object without modifying any data. It produced several gigabytes of replication traffic without a single data write and so without a single resume point. While the resume token was not designed to resume from an object, I've found that the send implementation always sends object before any data. So by requesting resume from offset 0 we are effectively resuming from the object, followed (or not) by the data at offset 0, just as we need it. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15927	2024-04-19 10:13:38 -07:00
Alexander Motin	793a2cff2a	Linux: Cleanup taskq threads spawn/exit This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15873	2024-04-19 10:13:38 -07:00
Alexander Motin	fdd97e0093	Refactor dmu_prefetch(). - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15076	2024-04-19 10:13:38 -07:00
Alexander Motin	3b8817db96	ZIL: Update Linux tracing after #15635 While picking parts from #14909 I've missed Linux tracing specific ones, that went unnoticed in default configurations, but breaks the build in some. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15730	2024-04-19 10:13:38 -07:00
Alexander Motin	25ea8ce94b	ZIL: Improve next log block size prediction Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15635	2024-04-19 10:13:38 -07:00
Alexander Motin	8b1a132de7	ZIO: Optimize zio_flush() - Generalize vdev_nowritecache handling by traversing through the VDEV tree and skipping children ZIOs where not supported. - Remove intermediate zio_null() in case of several VDEV children. - Remove children handling from zio_ioctl(). There are no other use cases for this code beside DKIOCFLUSHWRITECACHED, and would there be, I doubt they would so straightforward apply to all VDEV children. Comparing to removed previous optimization this should improve cases of redundant ZILs/SLOGs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15515	2024-04-19 10:13:38 -07:00
Alexander Motin	7ea8331009	ZIL: Detect single-threaded workloads ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15381	2024-04-19 10:13:38 -07:00
Rob N	3c5f354a8c	zvol_os: fix compile with blk-mq on Linux 4.x `99741bde5` accesses a cached blk-mq hardware context through the mq_hctx field of struct request. However, this field did not exist until 5.0. Before that, the private function blk_mq_map_queue() was used to dig it out of broader queue context. This commit detects this situation, and handles it with a poor-man's simulation of that function. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16069	2024-04-17 10:10:24 -07:00
Rob N	5c0fe099ec	zvol_os: fix build on Linux <3.13 `99741bde5` introduced zvol_num_taskqs, but put it behind the HAVE_BLK_MQ define, preventing builds on versions of Linux that don't have it (<3.13, incl EL7). Nothing about it seems dependent on blk-mq, so this just moves it out from behind that define and so fixes the build. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16062	2024-04-17 10:10:24 -07:00
Ameer Hamza	5fc134ff2f	zvol: use multiple taskq Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15992	2024-04-17 10:10:24 -07:00
Rob Norris	7ad2616d37	vdev_disk: fix alignment check when buffer has non-zero starting offset If a linear buffer spans multiple pages, and the first page has a non-zero starting offset, the checker would not include the offset, and so would think there was an alignment gap at the end of the first page, rather than at the start. That is, for a 16K buffer spread across five pages with an initial 512B offset: [.XXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXX.] It would be interpreted as: [XXXXXXX.][XXXXXXXX]... And be rejected as misaligned. Since it's already a linear ABD, the "linearising" copy would just reuse the buffer as-is, and the second check would failing, tripping the VERIFY in vdev_disk_io_rw(). This commit fixes all this by including the offset in the check for end-of-page alignment. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `1bf649cb0a`)	2024-04-12 08:53:48 -07:00
Rob N	d0d9dccc61	vdev_disk: ensure trim errors are returned immediately After `08fd5ccc3`, the discard issuing code was organised such that if requesting an async discard or secure erase failed before the IO was issued (that is, calling __blkdev_issue_discard() returned an error), the failed zio would never be executed, resulting in txg_sync hanging forever waiting for IO to finish. This commit fixes that by immediately executing a failed zio on error. To handle the successful synchronous op case, we fake an async op by, when not using an asynchronous submission method, queuing the successful result zio as part of the discard handler. Since it was hard to understand the differences between discard and secure erase, and sync and async, across different kernel versions, I've commented and reorganised the code a bit to try and make everything more contained and linear. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `ba9f587a77`)	2024-04-11 12:25:40 -07:00
Rob Norris	28520cad25	vdev_disk: don't touch vbio after its handed off to the kernel After IO is unplugged, it may complete immediately and vbio_completion be called on interrupt context. That may interrupt or deschedule our task. If its the last bio, the vbio will be freed. Then, we get rescheduled, and try to write to freed memory through vbio->. This patch just removes the the cleanup, and the corresponding assert. These were leftovers from a previous iteration of vbio_submit() and were always "belt and suspenders" ops anyway, never strictly required. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc Reported-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `917ff75e95`)	2024-04-08 10:13:55 -07:00
Robert Evans	deb7a84231	Fix corruption caused by mmap flushing problems 1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com> Closes #15933 Closes #16019	2024-03-29 17:10:04 -07:00
Rob Norris	eebf00bee9	vdev_disk: default to classic submission for 2.2.x We don't want to change to brand-new code in the middle of a stable series, but we want it available to test for people running into page splitting issues. This commits make zfs_vdev_disk_classic=1 the default, and updates the documentation to better explain what's going on. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.	2024-03-28 13:29:46 -07:00
Rob Norris	d0b3be763f	abd_iter_page: don't use compound heads on Linux <4.5 Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `c6be6ce175`)	2024-03-28 13:29:46 -07:00
Rob Norris	cb599d27ed	vdev_disk: use bio_chain() to submit multiple BIOs Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `72fd834c47`)	2024-03-28 13:29:46 -07:00
Rob Norris	af3a5bb40d	vdev_disk: add module parameter to select BIO submission method This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `df2169d141`)	2024-03-28 13:29:46 -07:00
Rob Norris	51c2bd0def	vdev_disk: rewrite BIO filling machinery to avoid split pages This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- \| \| \| \| \| :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `06a196020e`)	2024-03-28 13:29:46 -07:00
Rob Norris	03ff875e09	vdev_disk: make read/write IO function configurable This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `c4a13ba483`)	2024-03-28 13:29:46 -07:00
Rob Norris	13b5348848	vdev_disk: reorganise vdev_disk_io_start Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `867178ae1d`)	2024-03-28 13:29:46 -07:00
Rob Norris	4820185031	vdev_disk: rename existing functions to vdev_classic_* This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `f3b85d706b`)	2024-03-28 13:29:46 -07:00
Rob Norris	52a2af6fd1	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `390b448726`)	2024-03-28 13:29:46 -07:00
Rob Norris	220bb7341e	linux 5.4 compat: page_size() Before 5.4 we have to do a little math. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `df04efe321`)	2024-03-28 13:29:46 -07:00
Rob N	58211157bf	Linux 6.8 compat: use splice_copy_file_range() for fallback Linux 6.8 removes generic_copy_file_range(), which had been reduced to a simple wrapper around splice_copy_file_range(). Detect that function directly and use it if generic_ is not available. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15930 Closes #15931 (cherry picked from commit `ef08a4d406`)	2024-03-21 09:35:17 -07:00
Tony Hutter	c883088df8	Tag zfs-2.2.3 META file and changelog updated. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2024-02-21 09:26:51 -08:00
Alexander Motin	c0c4866f8a	dmu: Allow buffer fills to fail When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665	2024-02-20 15:53:02 -08:00
Tony Hutter	b62fd2cef9	ZTS: Skip cross-fs bclone tests if FreeBSD < 14.0 Skip cross filesystem block cloning tests on FreeBSD if running less than version 14.0. Cross filesystem copy_file_range() was added in FreeBSD 14. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15901	2024-02-16 09:33:26 -08:00
Tony Hutter	d92fbe2150	[zfs-2.2.3] ZTS: Use correct bclone module param name on FreeBSD The bclone module names are not prefixed with 'zfs' on FreeBSD. This was causing test failues. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2024-02-16 09:33:05 -08:00
Bi11	a4978d2605	zdb: Fix false leak report for BRT objects Fix a misreport in 'zdb -d' where it falsely marked BRT objects as leaked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com> Closes #15882	2024-02-12 17:03:17 -08:00
Dex Wood	a6f6c881ff	Add Ntfy notification support to ZED This commit adds the zed_notify_ntfy() function and hooks it into zed_notify(). This will allow ZED to send notifications to ntfy.sh or a self-hosted Ntfy service, which can be received on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC, ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc. Reviewed-by: @classabbyamp Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dex Wood <slash2314@gmail.com> Closes #15584	2024-02-12 14:32:11 -08:00
Bi11	fc3d34bd08	BRT: Fix slop space calculation with block cloning Similar to deduplication, the size of data duplicated by block cloning should not be included in the slop space calculation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuxin Wang <yuxinwang9999@gmail.com> Closes #15874	2024-02-12 14:04:27 -08:00
Rob N	36116b4612	zfs list: add '-t fs' and '-t vol' options (#15883 ) Because "filesystem" and "volume" are just too long! Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15864 (cherry picked from commit `a5a725440b`)	2024-02-12 14:04:27 -08:00
Tony Hutter	b699dacb4a	[zfs-2.2.3] Enable zfs_bclone_enabled on cp_files tests cp_files_002_pos uses BRT, so enable block cloning in setup/cleanup. This is only something we need to do in zfs-2.2.3, since 2.2.x ships with block cloning disabled by default. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2024-02-12 14:04:21 -08:00
the-Chain-Warden-thresh	d22bf6a9bd	LUA: Backport CVE-2020-24370's patch CVE-2020-24370 is a security vulnerability in lua. Although the CVE description in CVE-2020-24370 said that this CVE only affected lua 5.4.0, according to lua this CVE actually existed since lua 5.2. The root cause of this CVE is the negation overflow that occurs when you try to take the negative of 0x80000000. Thus, this CVE also exists in openzfs. Try to backport the fix to the lua in openzfs since the original fix is for 5.4 and several functions have been changed. https://github.com/advisories/GHSA-gfr4-c37g-mm3v https://nvd.nist.gov/vuln/detail/CVE-2020-24370 https://www.lua.org/bugs.html#5.4.0-11 https://github.com/lua/lua/commit/a585eae6e7ada1ca9271607a4f48dfb1786 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: ChenHao Lu <18302010006@fudan.edu.cn> Closes #15847	2024-02-08 15:22:16 -08:00
Cameron Harr	40e20d808c	Add 'zpool status -e' flag to see unhealthy vdevs When very large pools are present, it can be laborious to find reasons for why a pool is degraded and/or where an unhealthy vdev is. This option filters out vdevs that are ONLINE and with no errors to make it easier to see where the issues are. Root and parents of unhealthy vdevs will always be printed. Testing: ZFS errors and drive failures for multiple vdevs were simulated with zinject. Sample vdev listings with '-e' option - All vdevs healthy NAME STATE READ WRITE CKSUM iron5 ONLINE 0 0 0 - ZFS errors NAME STATE READ WRITE CKSUM iron5 ONLINE 0 0 0 raidz2-5 ONLINE 1 0 0 L23 ONLINE 1 0 0 L24 ONLINE 1 0 0 L37 ONLINE 1 0 0 - Vdev faulted NAME STATE READ WRITE CKSUM iron5 DEGRADED 0 0 0 raidz2-6 DEGRADED 0 0 0 L67 FAULTED 0 0 0 too many errors - Vdev faults and data errors NAME STATE READ WRITE CKSUM iron5 DEGRADED 0 0 0 raidz2-1 DEGRADED 0 0 0 L2 FAULTED 0 0 0 too many errors raidz2-5 ONLINE 1 0 0 L23 ONLINE 1 0 0 L24 ONLINE 1 0 0 L37 ONLINE 1 0 0 raidz2-6 DEGRADED 0 0 0 L67 FAULTED 0 0 0 too many errors - Vdev missing NAME STATE READ WRITE CKSUM iron5 DEGRADED 0 0 0 raidz2-6 DEGRADED 0 0 0 L67 UNAVAIL 3 1 0 - Slow devices when -s provided with -e NAME STATE READ WRITE CKSUM SLOW iron5 DEGRADED 0 0 0 - raidz2-5 DEGRADED 0 0 0 - L10 FAULTED 0 0 0 0 external device fault L51 ONLINE 0 0 0 14 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Cameron Harr <harr1@llnl.gov> Closes #15769	2024-02-08 15:22:16 -08:00
Mauricio Faria de Oliveira	9bb8d26bd5	zed: fix typo in variable ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT Replace ENCLO_US_RE with ENCLO_SU_RE in the name of the variable. Note this changes the user-visible string in zed.rc, thus might break current users with the wrong string, but it's ~2 months since zfs-2.2.0 tag is out, thus should not be widespread yet. Mechanical change: $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT cmd/zed/zed.d/zed.rc cmd/zed/zed.d/statechange-slot_off.sh $ sed -i 's/ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT/<linebreak> ZED_POWER_OFF_ENCLOSURE_SLOT_ON_FAULT/g' \ cmd/zed/zed.d/zed.rc \ cmd/zed/zed.d/statechange-slot_off.sh $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT $ Fixes `11fbcacf37` ("zed: Add zedlet to power off slot when drive is faulted") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Closes #15651	2024-02-08 15:22:16 -08:00
Umer Saleem	08fd5ccc38	Improve performance for zpool trim on linux On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue trim command which is synchronous. This commit updates vdev_disk_io_trim to use __blkdev_issue_discard, which is asynchronous. Unfortunately there isn't any asynchronous version for blkdev_issue_secure_erase, so performance of secure trim will still suffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15843	2024-02-06 12:58:55 -08:00

1 2 3 4 5 ...

8995 Commits