mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-07-03 06:27:35 +03:00

Author	SHA1	Message	Date
c1ick	ec580bc520	zfs: add bounds checking to zil_parse (#16308 ) Make sure log record don't stray beyond valid memory region. There is a lack of verification of the space occupied by fixed members of lr_t in the zil_parse. We can create a crafted image to trigger an out of bounds read by following these steps: 1) Do some file operations and reboot to simulate abnormal exit without umount 2) zil_chain.zc_nused: 0x1000 3) First lr_t lr_t.lrc_txtype: 0x0 lr_t.lrc_reclen: 0x1000-0xb8-0x1 lr_t.lrc_txg: 0x0 lr_t.lrc_seq: 0x1 4) Update checksum in zil_chain.zc_eck Fix: Add some checks to make sure the remaining bytes are large enough to hold an log record. Signed-off-by: XDTG <click1799@163.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-31 17:17:04 -07:00
Rob Norris	7ddc1f737f	zil: add stats for commit failure/fallback (#16315 ) There's no good way to tell when a ZIL commit fails and falls back to a transaction sync, other than perhaps a throughput drop. This adds counters so we can see when it happens and why. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-25 16:53:59 -07:00
Rob Norris	c9c838aa1f	zio: remove io_cmd and DKIOCFLUSHWRITECACHE There's no other options, so we can just always assume its a flush. Includes some light refactoring where a switch statement was doing control flow that no longer works. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:11 -07:00
George Wilson	493fcce9be	Provide macros for setting and getting blkptr birth times There exist a couple of macros that are used to update the blkptr birth times but they can often be confusing. For example, the BP_PHYSICAL_BIRTH() macro will provide either the physical birth time if it is set or else return back the logical birth time. The complement to this macro is BP_SET_BIRTH() which will set the logical birth time and set the physical birth time if they are not the same. Consumers may get confused when they are trying to get the physical birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that the logical birth time is what is actually returned. This change cleans up these macros and makes them symmetrical. The same functionally is preserved but the name is changed. Instead of calling BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In additional to cleaning up this naming conventions, two new sets of macros are introduced -- BP_[SET\|GET]_LOGICAL_BIRTH() and BP_[SET\|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to get and set the specific birth time. As part of the cleanup, the unused GRID macros have been removed and that portion of the blkptr are currently unused. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #15962	2024-03-25 15:01:54 -07:00
Alexander Motin	eff77a802d	ZIL: Improve next log block size prediction Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15635	2023-12-21 10:54:44 -08:00
Alexander Motin	2aa3a482ab	ZIL: Remove 128K into 2x68K LWB split optimization To improve 128KB block write performance in case of multiple VDEVs ZIL used to spit those writes into two 64KB ones. Unfortunately it was found to cause LWB buffer overflow, trying to write maximum- sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, since unlike TX_WRITE ZIL code can't split it. This is a minimally-invasive temporary block cloning fix until the following more invasive prediction code refactoring. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15634	2023-12-06 15:02:05 -08:00
Alexander Motin	55b764e062	ZIL: Do not clone blocks from the future ZIL claim can not handle block pointers cloned from the future, since they are not yet allocated at that point. It may happen either if the block was just written when it was cloned, or if the pool was frozen or somehow else rewound on import. Handle it from two sides: prevent cloning of blocks with physical birth time from not yet synced or frozen TXG, and abort ZIL claim if we still detect such blocks due to rewind or something else. While there, assert that any cloned blocks we claim are really allocated by calling metaslab_check_free(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15617	2023-12-05 10:58:11 -08:00
Alexander Motin	2a27fd4111	ZIL: Assert record sizes in different places This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15517	2023-11-28 13:35:14 -08:00
Alexander Motin	5a3bffab10	ZIO: Optimize zio_flush() - Generalize vdev_nowritecache handling by traversing through the VDEV tree and skipping children ZIOs where not supported. - Remove intermediate zio_null() in case of several VDEV children. - Remove children handling from zio_ioctl(). There are no other use cases for this code beside DKIOCFLUSHWRITECACHED, and would there be, I doubt they would so straightforward apply to all VDEV children. Comparing to removed previous optimization this should improve cases of redundant ZILs/SLOGs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15515	2023-11-17 14:00:59 -08:00
Alexander Motin	3afdc97d91	ZIO: Remove READY pipeline stage from root ZIOs zio_root() has no arguments for ready callback or parent ZIO. Except one recent case in ZIL code if root ZIOs ever have a parent it is also a root ZIO. It means we do not need READY pipeline stage for them, which takes some time to process, but even more time to wait for the children and be woken by them, and both for no good reason. The most visible effect of this change is that it avoids one taskq wakeup per ZIL block written, previously used to run zio_ready() for lwb_root_zio and skipped now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15398	2023-10-25 15:22:25 -07:00
Alexander Motin	252f46be7d	ZIL: Detect single-threaded workloads ... by checking that previous block is fully written and flushed. It allows to skip commit delays since we can give up on aggregation in that case. This removes zil_min_commit_timeout parameter, since for single-threaded workloads it is not needed at all, while on very fast devices even some multi-threaded workloads may get detected as single-threaded and still bypass the wait. To give multi-threaded workloads more aggregation chances increase zfs_commit_timeout_pct from 5 to 10%, as they should suffer less from additional latency. Also single-threaded workloads detection allows in perspective better prediction of the next block size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15381	2023-10-24 14:35:25 -07:00
John Wren Kennedy	c0e58995e3	Large sync writes perform worse with slog For synchronous write workloads with large IO sizes, a pool configured with a slog performs worse than one with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 1292 438 -66.10% Write Bandwidth: 1323570 448910 -66.08% Write Latency: 12128400 36330970 3.0x sequential_writes 1m sync ios, 32 threads Write IOPS: 1293 430 -66.74% Write Bandwidth: 1324184 441188 -66.68% Write Latency: 24486278 74028536 3.0x The reason is the `zil_slog_bulk` variable. In `zil_lwb_write_open`, if a zil block is greater than 768K, the priority of the write is downgraded from sync to async. Increasing the value allows greater throughput. To select a value for this PR, I ran an fio workload with the following values for `zil_slog_bulk`: zil_slog_bulk KiB/s 1048576 422132 2097152 478935 4194304 533645 8388608 623031 12582912 827158 16777216 1038359 25165824 1142210 33554432 1211472 50331648 1292847 67108864 1308506 100663296 1306821 134217728 1304998 At 64M, the results with a slog are now improved to parity with an embedded zil: sequential_writes 1m sync ios, 16 threads Write IOPS: 438 1288 2.9x Write Bandwidth: 448910 1319062 2.9x Write Latency: 36330970 12163408 -66.52% sequential_writes 1m sync ios, 32 threads Write IOPS: 430 1290 3.0x Write Bandwidth: 441188 1321693 3.0x Write Latency: 74028536 24519698 -66.88% None of the other tests in the performance suite (run with a zil or slog) had a significant change, including the random_write_zil tests, which use multiple datasets. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com> Closes #14378	2023-10-13 11:15:09 -07:00
Alexander Motin	66b81b3497	ZIL: Reduce maximum size of WR_COPIED to 7.5K Benchmarks show that at certain write sizes range lock/unlock take not so much time as extra memory copy. The exact threshold is not obvious due to other overheads, but it is definitely lower than ~63KB used before. Make it configurable, defaulting at 7.5KB, that is 8KB of nearest malloc() size minus itx and lr structs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15353	2023-10-06 10:09:27 -07:00
Alexander Motin	90149552b1	ZIL: Fix potential race on flush deferring. zil_lwb_set_zio_dependency() can not set write ZIO dependency on previous LWB's write ZIO if one is already in done handler and set state to LWB_STATE_WRITE_DONE. So theoretically done handler of next LWB's write ZIO may run before done handler of previous LWB write ZIO completes. In such case we can not defer flushes, since the flush issue process is not locked. This may fix some reported assertions of lwb_vdev_tree not being empty inside zil_free_lwb(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15278	2023-09-20 11:17:11 -07:00
Alexander Motin	9da6b60417	ZIL: Change ZIOs issue order. In zil_lwb_write_issue(), after issuing lwb_root_zio/lwb_write_zio, we have no right to access lwb->lwb_child_zio. If it was not there, the first two ZIOs may have already completed and freed the lwb. ZIOs issue in opposite order from children to parent should keep the lwb valid till the end, since the lwb can be freed only after lwb_root_zio completion callback. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15233	2023-09-01 17:14:50 -07:00
Alexander Motin	b1b99e10a6	ZIL: Revert zl_lock scope reduction. While I have no reports of it, I suspect possible use-after-free scenario when zil_commit_waiter() tries to dereference zcw_lwb for lwb already freed by zil_sync(), while zcw_done is not set. Extension of zl_lock scope as it was originally should block zil_sync() from freeing the lwb, closing this race. This reverts #14959 and couple chunks of #14841. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15228	2023-09-01 17:13:52 -07:00
Alexander Motin	bbcf18c293	ZIL: Tune some assertions. In zil_free_lwb() we should first assert lwb_state or the rest of assertions can be misleading if it is false. Add lwb_state assertions in zil_lwb_add_block() to make sure we are not trying to add elements to lwb_vdev_tree after it was processed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15227	2023-09-01 17:13:22 -07:00
Alexander Motin	eda3fcd56f	ZIL: Second attempt to reduce scope of zl_issuer_lock. The previous patch #14841 appeared to have significant flaw, causing deadlocks if zl_get_data callback got blocked waiting for TXG sync. I already handled some of such cases in the original patch, but issue #14982 shown cases that were impossible to solve in that design. This patch fixes the problem by postponing log blocks allocation till the very end, just before the zios issue, leaving nothing blocking after that point to cause deadlocks. Before that point though any sleeps are now allowed, not causing sync thread blockage. This require slightly more complicated lwb state machine to allocate blocks and issue zios in proper order. But with removal of special early issue workarounds the new code is much cleaner now, and should even be more efficient. Since this patch uses null zios between write, I've found that null zios do not wait for logical children ready status in zio_ready(), that makes parent write to proceed prematurely, producing incorrect log blocks. Added ZIO_CHILD_LOGICAL_BIT to zio_wait_for_children() fixes it. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15122	2023-08-24 17:08:49 -07:00
Alexander Motin	8e20e0ff39	ZIL: Replay blocks without next block pointer. If we get next block allocation error during log write, we trigger transaction commit. But the block we have just completed is still written and transactions it covers will be acknowledged normally. If after that we ignore the block during replay just because it is the last in the chain, we may not replay some transactions that we have acknowledged as synced, that is not right. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15132	2023-08-11 09:04:44 -07:00
Alexander Motin	b22bab2547	Remove fastwrite mechanism. Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15107	2023-07-28 13:30:33 -07:00
Alexander Motin	2848de11e5	Remove zl_issuer_lock from zil_suspend(). This locking was recently added as part of #14979. But appears it is illegal to take zl_issuer_lock while holding dp_config_rwlock, taken by dsl_pool_hold(). It causes deadlock with sync thread in spa_sync_upgrades(). On a second thought, we should not need this locking, since zil_commit_impl() we call below takes zl_issuer_lock, that should sufficiently protect zl_suspend reads, combined with other logic from #14979. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15103	2023-07-25 09:08:36 -07:00
Alexander Motin	2cb992a99c	ZIL: Fix config lock deadlock. When we have some LWBs closed and their ZIOs ready to be issued, we can not afford sleeping on config lock if somebody else try to lock it as writer, or it will cause a deadlock. To solve it, move spa_config_enter() from zil_lwb_write_issue() to zil_lwb_write_close() under zl_issuer_lock to enforce lock ordering with other threads. Now if we can't immediately lock config, issue all previously closed LWBs so that they could drop their config locks after completion, and only then allow sleeping on our lock. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15078 Closes #15080	2023-07-24 13:41:11 -07:00
Alexander Motin	233425a153	Again fix race between zil_commit() and zil_suspend(). With zl_suspend read in zil_commit() not protected by any locks it is possible for new ZIL writes to be in progress while zil_destroy() called by zil_suspend() freeing them. This patch closes the race by taking zl_issuer_lock in zil_suspend() and adding the second zl_suspend check to zil_get_commit_list(), protected by the lock. It allows all already queued transactions to be logged normally, while blocks any new ones, calling txg_wait_synced() for the TXGs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14979	2023-06-30 08:59:39 -07:00
Alexander Motin	a9d6b0690b	ZIL: Fix another use-after-free. lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the `55b1842f92`, but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999	2023-06-27 17:03:37 -07:00
Alexander Motin	8e8acabdca	Fix memory leak in zil_parse(). `482da24e2` missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14987	2023-06-17 19:51:37 -07:00
Alexander Motin	55b1842f92	ZIL: Fix race introduced by `f63811f072`. We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE state and dropping zl_lock, since it may be freed by zil_sync(). To free itxs and waiters after dropping the lock we need to move lwb_itxs and lwb_waiters lists elements to local storage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14957 Closes #14959	2023-06-09 10:08:05 -07:00
Alexander Motin	482da24e20	ZIL: Allow to replay blocks of any size. There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14910	2023-06-02 11:01:58 -07:00
Alexander Motin	b6fbe61fa6	zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14863	2023-05-25 13:51:53 -07:00
Alexander Motin	f63811f072	ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14841	2023-05-25 09:48:43 -07:00
Alexander Motin	7381ddf1ab	zil: Free lwb_buf after write completion. There is no sense to keep that memory allocated during the flush. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14855	2023-05-12 09:49:26 -07:00
Alexander Motin	895e03135e	zil: Some micro-optimizations. Should not cause functional changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14854	2023-05-12 09:14:29 -07:00
Alexander Motin	469019fb0b	zil: Don't expect zio_shrink() to succeed. At least for RAIDZ zio_shrink() does not reduce zio size, but reduced wsz in that case likely results in writing uninitialized memory. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14853	2023-05-11 14:27:12 -07:00
Alexander Motin	2fd1c30423	Mark TX_COMMIT transaction with TXG_NOTHROTTLE. TX_COMMIT has no on-disk representation and does not produce any more dirty data. It should not wait for anything, and even just skipping the checks if not waiting gives improvement noticeable in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14798	2023-04-27 12:32:58 -07:00
Brian Behlendorf	0e8a42bbee	Revert "Fix data race between zil_commit() and zil_suspend()" This reverts commit `4c856fb333` to resolve a newly introduced deadlock which in practice in more disruptive that the issue this commit intended to address. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14775 Closes #14790	2023-04-25 16:40:55 -07:00
George Wilson	a604d3243b	Revert "Do not hold spa_config in ZIL while blocked on IO" This reverts commit `7d638df09b`. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14678	2023-03-28 08:13:32 -07:00
Pawel Jakub Dawidek	67a1b03791	Implementation of block cloning for ZFS Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13392	2023-03-10 11:59:53 -08:00
Richard Yao	7d638df09b	Do not hold spa_config in ZIL while blocked on IO Otherwise, we can get a deadlock that looks like this: 1. fsync() grabs spa_config_enter(zilog->zl_spa, SCL_STATE, lwb, RW_READER) as part of zil_lwb_write_issue() . It then blocks on the txg_sync when a flush fails from a drive power cycling. 2. The txg_sync then blocks on the pool suspending due to the loss of too many disks. 3. zpool clear then blocks on spa_config_enter(spa, SCL_STATE \| SCL_L2ARC \| SCL_ZIO, spa, RW_WRITER) because it is a writer. The disks cannot be brought online due to fsync() holding that lock and the user gets upset since fsync() is uninterruptibly blocked inside the kernel. We need to grab the lock for vdev_lookup_top(), but we do not need to hold it while there is outstanding IO. This fixes a regression introduced by `1ce23dcaff`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14519	2023-03-07 16:12:28 -08:00
Richard Yao	4c856fb333	Fix data race between zil_commit() and zil_suspend() openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <arun.kv@datacore.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14514	2023-03-01 13:23:09 -08:00
Richard Yao	3a7c35119e	Handle unexpected errors in zil_lwb_commit() without ASSERT() We tripped `ASSERT(error == ENOENT \|\| error == EEXIST \|\| error == EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing of ZIL against drive power cycles. That assertion presumably exists because when this code was written, the only errors expected from here were EIO, ENOENT, EEXIST and EALREADY, with EIO having its own handling before the assertion. However, upon doing a manual depth first search traversal of the source tree, it turns out that a large number of unexpected errors are possible here. In theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most unexpected errors originate in the block layer and come to us from zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold() -> dbuf_read() -> zio_wait(). From vdev_disk.c on Linux alone, zio_wait() can return the unexpected errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK, EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM This was only observed after what have been likely over 1000 test iterations, so we do not expect to reproduce this again to find out what the error code was. However, circumstantial evidence suggests that the error was ENXIO. When ENXIO or any other unexpected error occurs, the `fsync()` or equivalent operation that called zil_commit() will return success, when in fact, dirty data has not been committed to stable storage. This is a violation of the Single UNIX Specification. The code should be able to handle this and any other unknown error by calling `txg_wait_synced()`. In addition to changing the code to call txg_wait_synced() on unexpected errors instead of returning, we modify it to print information about unexpected errors to dmesg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14532	2023-03-01 09:39:41 -08:00
Alexander Motin	0f740a4f1d	Introduce minimal ZIL block commit delay Despite all optimizations, tests on actual hardware show that FreeBSD kernel can't sleep for less then ~2us. Similar tests on Linux show ~50us delay at least from nanosleep() (haven't tested inside kernel). It means that on very fast log device ZIL may not be able to satisfy zfs_commit_timeout_pct block commit timeout, increasing log latency more than desired. Handle that by introduction of zil_min_commit_timeout parameter, specifying minimal timeout value where additional delays to aggregate writes may be skipped. Also skip delays if the LWB is more than 7/8 full, that often happens if I/O sizes are constant and match one of LWB sizes. Both things are applied only if there were no already outstanding log blocks, that may indicate single-threaded workload, that by definition can not benefit from the commit delays. While there, add short time moving average to zl_last_lwb_latency to make it more stable. Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of 1 there IOPS increase by 5.5% instead of expected 1%. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14418	2023-01-24 09:20:32 -08:00
Alan Somers	e197bb24f1	Optionally skip zil_close during zvol_create_minor_impl If there were no zil entries to replay, skip zil_close. zil_close waits for a transaction to sync. That can take several seconds, for example during pool import of a resilvering pool. Skipping zil_close can cut the time for "zpool import" from 2 hours to 45 seconds on a resilvering pool with a thousand zvols. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Sponsored-by: Axcient Closes #13999 Closes #14015	2022-11-08 12:38:08 -08:00
Ryan Moeller	748b9d5bda	zil: Relax assertion in zil_parse Rather than panic debug builds when we fail to parse a whole ZIL, let's instead improve the logging of errors and continue like in a release build. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #14116	2022-11-01 12:19:32 -07:00
Aleksa Sarai	dbf6108b4d	zfs_rename: support RENAME_* flags Implement support for Linux's RENAME_* flags (for renameat2). Aside from being quite useful for userspace (providing race-free ways to exchange paths and implement mv --no-clobber), they are used by overlayfs and are thus required in order to use overlayfs-on-ZFS. In order for us to represent the new renameat2(2) flags in the ZIL, we create two new transaction types for the two flags which need transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT). RENAME_NOREPLACE does not need any ZIL support because we know that if the operation succeeded before creating the ZIL entry, there was no file to be clobbered and thus it can be treated as a regular TX_RENAME. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Closes #12209 Closes #14070	2022-10-28 09:49:20 -07:00
Richard Yao	4938d01db7	Convert enum zio_flag to uint64_t We ran out of space in enum zio_flag for additional flags. Rather than introduce enum zio_flag2 and then modify a bunch of functions to take a second flags variable, we expand the type to 64 bits via `typedef uint64_t zio_flag_t`. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Richard Yao <richard.yao@klarasystems.com> Closes #14086	2022-10-27 09:54:54 -07:00
Richard Yao	ab8d9c1783	Cleanup: 64-bit kernel module parameters should use fixed width types Various module parameters such as `zfs_arc_max` were originally `uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long` for Linux compatibility because Linux's kernel default module parameter implementation did not support 64-bit types on 32-bit platforms. This caused problems when porting OpenZFS to Windows because its LLP64 memory model made `unsigned long` a 32-bit type on 64-bit, which created the undesireable situation that parameters that should accept 64-bit values could not on 64-bit Windows. Upon inspection, it turns out that the Linux kernel module parameter interface is extensible, such that we are allowed to define our own types. Rather than maintaining the original type change via hacks to to continue shrinking module parameters on 32-bit Linux, we implement support for 64-bit module parameters on Linux. After doing a review of all 64-bit kernel parameters (found via the man page and also proposed changes by Andrew Innes), the kernel module parameters fell into a few groups: Parameters that were originally 64-bit on Illumos: * dbuf_cache_max_bytes * dbuf_metadata_cache_max_bytes * l2arc_feed_min_ms * l2arc_feed_secs * l2arc_headroom * l2arc_headroom_boost * l2arc_write_boost * l2arc_write_max * metaslab_aliquot * metaslab_force_ganging * zfetch_array_rd_sz * zfs_arc_max * zfs_arc_meta_limit * zfs_arc_meta_min * zfs_arc_min * zfs_async_block_max_blocks * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes * zfs_deadman_checktime_ms * zfs_deadman_synctime_ms * zfs_initialize_chunk_size * zfs_initialize_value * zfs_lua_max_instrlimit * zfs_lua_max_memlimit * zil_slog_bulk Parameters that were originally 32-bit on Illumos: * zfs_per_txg_dirty_frees_percent Parameters that were originally `ssize_t` on Illumos: * zfs_immediate_write_sz Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It has been upgraded to 64-bit. Parameters that were `long`/`unsigned long` because of Linux/FreeBSD influence: * l2arc_rebuild_blocks_min_l2size * zfs_key_max_salt_uses * zfs_max_log_walking * zfs_max_logsm_summary_length * zfs_metaslab_max_size_cache_sec * zfs_min_metaslabs_to_flush * zfs_multihost_interval * zfs_unflushed_log_block_max * zfs_unflushed_log_block_min * zfs_unflushed_log_block_pct * zfs_unflushed_max_mem_amt * zfs_unflushed_max_mem_ppm New parameters that do not exist in Illumos: * l2arc_trim_ahead * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_arc_sys_free * zfs_deadman_ziotime_ms * zfs_delete_blocks * zfs_history_output_max * zfs_livelist_max_entries * zfs_max_async_dedup_frees * zfs_max_nvlist_src_size * zfs_rebuild_max_segment * zfs_rebuild_vdev_limit * zfs_unflushed_log_txg_max * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift * zfs_vnops_read_chunk_size * zvol_max_discard_blocks Rather than clutter the lists with commentary, the module parameters that need comments are repeated below. A few parameters were defined in Linux/FreeBSD specific code, where the use of ulong/long is not an issue for portability, so we leave them alone: * zfs_delete_blocks * zfs_key_max_salt_uses * zvol_max_discard_blocks The documentation for a few parameters was found to be incorrect: * zfs_deadman_checktime_ms - incorrectly documented as int * zfs_delete_blocks - not documented as Linux only * zfs_history_output_max - incorrectly documented as int * zfs_vnops_read_chunk_size - incorrectly documented as long * zvol_max_discard_blocks - incorrectly documented as ulong The documentation for these has been fixed, alongside the changes to document the switch to fixed width types. In addition, several kernel module parameters were percentages or held ashift values, so being 64-bit never made sense for them. They have been downgraded to 32-bit: * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_per_txg_dirty_frees_percent * zfs_unflushed_log_block_pct * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift Of special note are `zfs_vdev_max_auto_ashift` and `zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`, and passed to the kernel as `ulong`. This is inherently buggy on big endian 32-bit Linux, since the values would not be written to the correct locations. 32-bit FreeBSD was unaffected because its sysctl code correctly treated this as a `uint64_t`. Lastly, a code comment suggests that `zfs_arc_sys_free` is Linux-specific, but there is nothing to indicate to me that it is Linux-specific. Nothing was done about that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Original-patch-by: Andrew Innes <andrew.c12@gmail.com> Original-patch-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13984 Closes #14004	2022-10-13 10:03:29 -07:00
Richard Yao	a6ccb36b94	Add defensive assertions Coverity complains about possible bugs involving referencing NULL return values and division by zero. The division by zero bugs require that a block pointer be corrupt, either from in-memory corruption, or on-disk corruption. The NULL return value complaints are only bugs if assumptions that we make about the state of data structures are wrong. Some seem impossible to be wrong and thus are false positives, while others are hard to analyze. Rather than dismiss these as false positives by assuming we know better, we add defensive assertions to let us know when our assumptions are wrong. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13972	2022-10-12 11:25:18 -07:00
Richard Yao	fdc2d30371	Cleanup: Specify unsignedness on things that should not be signed In #13871, zfs_vdev_aggregation_limit_non_rotating and zfs_vdev_aggregation_limit being signed was pointed out as a possible reason not to eliminate an unnecessary MAX(unsigned, 0) since the unsigned value was assigned from them. There is no reason for these module parameters to be signed and upon inspection, it was found that there are a number of other module parameters that are signed, but should not be, so we make them unsigned. Making them unsigned made it clear that some other variables in the code should also be unsigned, so we also make those unsigned. This prevents users from setting negative values that could potentially cause bad behaviors. It also makes the code slightly easier to understand. Mostly module parameters that deal with timeouts, limits, bitshifts and percentages are made unsigned by this. Any that are boolean are left signed, since whether booleans should be considered signed or unsigned does not matter. Making zfs_arc_lotsfree_percent unsigned caused a `zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was removed. Removing the check was also necessary to prevent a compiler error from -Werror=type-limits. Several end of line comments had to be moved to their own lines because replacing int with uint_t caused us to exceed the 80 character limit enforced by cstyle.pl. The following were kept signed because they are passed to taskq_create(), which expects signed values and modifying the OpenSolaris/Illumos DDI is out of scope of this patch: * metaslab_load_pct * zfs_sync_taskq_batch_pct * zfs_zil_clean_taskq_nthr_pct * zfs_zil_clean_taskq_minalloc * zfs_zil_clean_taskq_maxalloc * zfs_arc_prune_task_threads Also, negative values in those parameters was found to be harmless. The following were left signed because either negative values make sense, or more analysis was needed to determine whether negative values should be disallowed: * zfs_metaslab_switch_threshold * zfs_pd_bytes_max * zfs_livelist_min_percent_shared zfs_multihost_history was made static to be consistent with other parameters. A number of module parameters were marked as signed, but in reality referenced unsigned variables. upgrade_errlog_limit is one of the numerous examples. In the case of zfs_vdev_async_read_max_active, it was already uint32_t, but zdb had an extern int declaration for it. Interestingly, the documentation in zfs.4 was right for upgrade_errlog_limit despite the module parameter being wrongly marked, while the documentation for zfs_vdev_async_read_max_active (and friends) was wrong. It was also wrong for zstd_abort_size, which was unsigned, but was documented as signed. Also, the documentation in zfs.4 incorrectly described the following parameters as ulong when they were int: * zfs_arc_meta_adjust_restarts * zfs_override_estimate_recordsize They are now uint_t as of this patch and thus the man page has been updated to describe them as uint. dbuf_state_index was left alone since it does nothing and perhaps should be removed in another patch. If any module parameters were missed, they were not found by `grep -r 'ZFS_MODULE_PARAM' \| grep ', INT'`. I did find a few that grep missed, but only because they were in files that had hits. This patch intentionally did not attempt to address whether some of these module parameters should be elevated to 64-bit parameters, because the length of a long on 32-bit is 32-bit. Lastly, it was pointed out during review that uint_t is a better match for these variables than uint32_t because FreeBSD kernel parameter definitions are designed for uint_t, whose bit width can change in future memory models. As a result, we change the existing parameters that are uint32_t to use uint_t. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13875	2022-09-27 16:42:41 -07:00
ixhamza	fb087146de	Add support for per dataset zil stats and use wmsum counters ZIL kstats are reported in an inclusive way, i.e., same counters are shared to capture all the activities happening in zil. Added support to report zil stats for every datset individually by combining them with already exposed dataset kstats. Wmsum uses per cpu counters and provide less overhead as compared to atomic operations. Updated zil kstats to replace wmsum counters to avoid atomic operations. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13636	2022-07-20 17:14:06 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
наб	a926aab902	Enable -Wwrite-strings Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:54 -07:00

1 2 3 4

158 Commits