mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Alexander Motin	5847626175	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-11-12 13:04:58 -08:00
Shreshth3	b0106a1b74	zdb: fix bug with -A flag Fixes #10544. According to the manpage, zdb -A should ignore all assertions. But it currently does not do that. This commit fixes this bug. Signed-off-by: Shreshth Srivastava <shreshthsrivastava2@gmail.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17825	2025-10-21 09:50:43 -07:00
Alexander Motin	b9356f06ed	Explicit set ashift for non-leaf vdevs Before this change ashift property was applied only to a leaf vdevs. As result, it worked only as a minimal value for parent vdevs, since bigger physical_ashift value reported by any child could be used instead when deciding parent's ashift, as if the ashift property was never set. This change explicitly passes ZPOOL_CONFIG_ASHIFT to all vdevs, allowing override for parents only if the passed value is below logical_ashift and so unacceptable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17826	2025-10-21 09:50:43 -07:00
Ivan Shapovalov	cf9163f250	zdb: adjust block histogram binning strategy Previously, a bin included all blocks _starting_ from given size (e.g., a "4K" bin would include all blocks within the [4K; 8K) region). This is counter-intuitive and does not match the typical use-case of the block histogram (that is, to estimate disk usage considering how ZFS' block allocation works). In other words, if I'm looking at the "4K" row, I'm interested in records that _fit into_ a 4K block. Adjust the binning strategy such that a bin includes all blocks _up to_ given size, such that e.g. a "4K" bin would include all blocks within the (2K; 4K] region. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16999	2025-10-21 09:50:43 -07:00
Ivan Shapovalov	250e2ec229	zdb: factor out block histogram bin number computation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16999	2025-10-21 09:50:43 -07:00
Ivan Shapovalov	968cfc3df2	zdb: add `--class=(normal\|special\|...)` to filter blocks by alloc class When counting blocks to generate block size histograms (`-bb`), accept a `--class=` argument (as a comma-separated list of either "normal", "special", "dedup" or "other") to only consider blocks that belong to these metaslab classes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16999	2025-10-21 09:50:43 -07:00
Ivan Shapovalov	627b530059	zdb: add `--bin=(lsize\|psize\|asize)` arg to control histogram binning When counting blocks to generate block size histograms (`-bb`), accept a `--bin=` argument to force placing blocks into all three bins based on this size. E.g. with `--bin=lsize`, a block with lsize=512K, psize=128K, asize=256K will be placed into the "512K" bin in all three output columns. This way, by looking at the "512K" row the user will be able to determine how well was ZFS able to compress blocks of this logical size. Conversely, with `--bin=psize`, by looking at the "128K" row the user will be able to determine how much overhead was incurred for storage of blocks of this physical size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16999	2025-10-21 09:50:43 -07:00
Ivan Shapovalov	6809137db5	zdb: convert `ALLOCATED_OPT` into anonymous enum We are adding more long-only options, so use an enum for all of them to avoid manually numbering these constants. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name> Closes #16999	2025-10-21 09:50:43 -07:00
Rob Norris	3e7e19e028	pool_iter_refresh: don't refresh pools twice In "all pools" mode, pool_iter_refresh() will call zpool_iter(), which will call zpool_refresh_stats() before calling add_pool(). If we already have the pool, this is a different handle, so we just release it and return. Back in pool_iter_refresh(), we then call zpool_stats_refresh() again for our handle on the same pool. All together, this means we're doing two ZFS_IOC_POOL_STATS calls into the kernel for every pool in the system. This isn't wrong, but it does double the pressure on global locks. Instead, we add a new function zpool_refresh_stats_from_handle() that simply copies the pool config and state from one handle to another, and use it to update our handle before we release it in add_pool(), so we only have one call per pool per interval. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17807	2025-10-21 09:50:43 -07:00
Rob Norris	4c84b77bc4	pool_iter_refresh: don't flag existing pools as refreshed zpool_iter() passes the callback a new instance of zpool_handle_t each time, so the existing handle in the pool_list AVL never actually gets a refresh. Internally, that means its zpool_config is never updated, and the old config is never moved to zpool_old_config. As a result, print_iostat() never sees any updated config, and so repeats the first line forever. This is the simplest workaround: just don't mark existing pools as refreshed. pool_list_refresh() will see this and refresh them. The downside is a second call to ZFS_IOC_POOL_STATS for existing pools, because zpool_iter() just called it for the handle we threw away. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17807	2025-10-21 09:50:43 -07:00
Rob Norris	37d8d4619f	zpool iostat: update pool counter when skipping boot row When skipping the boot row (with -y), the early loop meant we weren't updating the "last_npools" count. That means the count never advanced past zero, so cb_iteration was always reset to 0, leading to it being "stuck" on the boot line, printing the header and nothing else forever. Updating the pool counter on every loop sorts that out: it advances, cb_iteration moves properly, and normal rows are printed. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17807	2025-10-21 09:50:43 -07:00
Ameer Hamza	1585a10a85	Make mount/share errors non-fatal for zfs create/clone If zfs_mount_and_share() fails, the error propagates to zfs create/clone commands despite successful operation. If create/clone operations were successful, there's no point in making zfs_mount_and_share() failures fatal. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17799	2025-10-21 09:50:43 -07:00
Robert Evans	ead0fb736d	zinject: Introduce ready delay fault injection This adds a pause to the ZIO pipeline in the ready stage for matching I/O (data, dnode, or raw bookmark). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Robert Evans <evansr@google.com> Closes #17787	2025-10-21 09:50:43 -07:00
Rob Norris	35ec4b14ab	zpool iostat: refresh pool list every interval When running zpool iostat in interval mode, it would not notice any new pools created or imported, and would forget any destroyed or exported, so would not notice if they came back. This leads to outputting "no pools available" every interval until killed. It looks like this was at least intended to work; the comment above zpool_do_iostat() indicates that it is expected to "deal with pool creation/destruction" and that pool_list_update() would detect new pools. That call however was removed in `3e43edd2c5`, though its unclear if that broke this behaviour and it wasn't noticed, or if it never worked, or if something later broke it. That said, the lack of pool_list_update() is only part of the reason it doesn't work properly. The fundamental problem is that the various things involved in refreshing or updating the list of pools would aggressively ignore, remove, skip or fail on pools that stop existing, or that already exist. Mostly this meant that once a pool is removed from the list, it will never be seen again. Restoring pool_list_update() to the zpool_do_iostat() loop only partially fixes this - it would find "new" pools again, but only in the "all pools" (no args) mode, and because its iterator callback add_pool() would abort the iterator if it already has a pool listed, it would only add pools if there weren't any already. So, this commit reworks the structure somewhat. pool_list_update() becomes pool_list_refresh(), and will ensure the state of all pools in the list are updated. In the "all pools" mode, it will also add new pools and remove pools that disappear, but when a fixed list of pools is used, the list doesn't change, only the state of the pools within it. The rest of the commit is adjusting things for this much simpler structure. Regardless of the mode in use, pool_list_refresh() will always do the right thing, so the driver code can just get on with the display. Now that pools can appear and disappear, I've made it so the header (if enabled) is re-printed when the list changes, so that its easier to see what's happening if the column widths change. Since this is all rather complicated, I've included tests for the "all pools" and "set of pools" modes. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17786	2025-09-29 16:50:49 -07:00
patrickxia	e1a6ec42d4	zdb: add ZFS_KEYFORMAT_RAW support for -K option This change adds support for ZFS_KEYFORMAT_RAW to zdb_derive_key in zdb.c. The implementation reads the raw key from the file specified by the -K option which is consistent with how raw keys are handled in the other parts of ZFS, along with a check to ensure that the keyfile doesn't have too many bytes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Patrick Xia <patrickx@google.com> Closes #17783	2025-09-25 12:08:20 -07:00
Brian Behlendorf	954fe5e1be	Add interface to interface spa_get_worst_case_min_alloc() function Provide an interface to retrieve the lowest and highest minimum allocation size for the normal allocation class. This can be used by external consumers of the DMU to estimate potential wasted capacity when setting the recordsize for an object. The new "min_alloc" and "max_alloc" keys are added to the pool configuration and used by default_volblocksize() to warn when an ineffecient block size is requested. For older kmods which don't yet include the new keys fallback to the previous logic. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17758	2025-09-25 12:08:14 -07:00
Brian Behlendorf	d33d0cac5a	Fix 'zpool add' safety check corner cases Three cases were discovered where 'zpool add' would fail to warn when adding vdevs to a pool with a mismatched replication level. These are: 1. When a pool contains mixed file and disk vdevs. 2. When a pool contains an active dRAID distributed spare 3. When a pool contains an active hot spare The lack of warnings are caused by get_replication() assessing the current pool configuration an inconsistent and disabling the mismatched replication check for the new pool configuration after 'zpool add'. This change updates get_replication() to be slightly more tolerant in the non-fatal case. The zpool_add_010_pos.ksh test case was split in to separate tests: zpool_add_warn_create.ksh, pool_add_warn_degraded.ksh, and zpool_add_warn_removal. These test were extended to include coverage for dRAID pools and the three scenarios described above. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17780	2025-09-25 12:08:09 -07:00
Alexander Motin	61a68554de	zdb: Fix asize overflow in verify_livelist_allocs() Spacemap entry might be too big to fit into a block pointer ashift. We hit an assertion trying to run `zdb -bvy` on a large pool. But it seems the code does not really need size there, since we only need to search for a range of offsets, so setting it to zero should just make btree return position just before the first entry. I suspect the previous code could actually miss the first entry due to this if its size was smaller. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17764	2025-09-25 12:07:55 -07:00
trick2011	9bcda0b5fe	Use "vdev" instead of "devices" when referring to vdevs Update documentation to use the correct terminology. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: trick2011 <trick2011@users.noreply.github.com> Closes #17734 Closes #17755	2025-09-25 12:07:52 -07:00
Alan Somers	ef9b7dde91	Fix a printf format specifier on FreeBSD/i386 This is breaking the build on FreeBSD/i386. Originally committed downstream as https://github.com/freebsd/freebsd-src/commit/2d76470b701 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #17705	2025-09-17 16:34:24 -07:00
buzzingwires	5f7253ca11	Refactor `zhack label repair` and fix `-c` regression on nonzero TXG This commit fixes a likely regression introduced by 64db435 where the checksum repair functionality (`-c` or default behavior) will perform checks and access data associated with the newer undetach (`-u`) functionality, resulting in a failure when an uberblock's TXG is not 0 as required by `-u` but not `-c` Additionally, code is refactored for better separation of tasks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: buzzingwires <buzzingwires@outlook.com> Closes #17732	2025-09-17 16:33:59 -07:00
Allan Jude	6c4ede4026	ZFS allow send:encrypted A new `zfs allow` permissions that ONLY allows sending replication streams in raw (encrypted) mode, so encrypted data will not be decrypted as part of the replication process. Sponsored-by: Klara, Inc. Sponsored-by: Karakun AG Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Co-authored-by: JT Pennington <jt.pennington@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #17543	2025-09-12 15:05:02 -07:00
Tony Hutter	4a7a04630d	zed: Add synchronous zedlets Historically, ZED has blindly spawned off zedlets in parallel and never worried about their completion order. This means that you can potentially have zedlets for event number 2 starting before zedlets for event number 1 had finished. Most of the time this is fine, and it actually helps a lot when the system is getting spammed with hundreds of events. However, there are times when you want your zedlets to be executed in sequence with the event ID. That is where synchronous zedlets come in. ZED will wait for all previously spawned zedlets to finish before running a synchronous zedlet. Synchronous zedlets are guaranteed to be the only zedlet running. No other zedlets may run in parallel with a synchronous zedlet. Users should be careful to only use synchronous zedlets when needed, since they decrease parallelism. To make a zedlet synchronous, simply add a "-sync-" immediately following the event name in the zedlet's file name: EVENT_NAME-sync-ZEDLETNAME.sh For example, if you wanted a synchronous statechange script: statechange-sync-myzedlet.sh Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17335	2025-09-11 15:58:59 -07:00
Paul Dagnelie	e2e708241a	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 15:01:32 -07:00
Paul Dagnelie	26983d6fa7	Add allocation profile export and zhack subcommand for import When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though in that case the allocation map may actually be from several different TXGs as new ones are loaded on demand. The second is a new subcommand for zhack, zhack metaslab leak (and its supporting kernel changes). This is a zhack subcommand that imports a pool and then modified the range trees of the metaslabs, allowing the sync process to write them out normall. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding free subcommand (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17576	2025-09-10 15:01:28 -07:00
Shengqi Chen	717c57c834	cmd: rename arcstat to zarcstat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 15:01:20 -07:00
Shengqi Chen	743866cd2a	cmd: rename arc_summary to zarcsummary Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 15:01:16 -07:00
Shengqi Chen	5bf1500ee3	Remove renaming notice and symlinks for arcstat and arc_summary Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17712	2025-09-10 15:01:12 -07:00
Rob Norris	02fa962af0	cmd: force zarcstat/zarc_summary recreation at install If the target already exists, lt will fail. Force it to recreate the symlinks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17702	2025-09-09 17:06:29 -07:00
Shengqi Chen	f8e2152db7	Install zarcstat and zarcsummary symlinks in Makefile Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17695	2025-09-09 17:05:30 -07:00
Shengqi Chen	cbc6d57012	Add upcoming renaming notice for arc_summary and arcstat They will become zarcsummary and zarcstat in 2.4.0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #16357 Closes #17695	2025-09-09 17:05:26 -07:00
ofthesun9	5846a85155	Update compatibility.d files Add an openzfs-2.4 compatibility file for the next release. While there are no compatibility difference between Linux and FreeBSD for 2.4 symlinks for the -linux and -freebsd names are created for any scripts expecting that convention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: ofthesun9 <olivier@ofthesun.net> Closes #17672 Closes #17673	2025-09-09 17:04:01 -07:00
Mark Johnston	2fc6bf82b6	zdb: Fix format strings on 32-bit systems Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #17665	2025-09-09 17:03:31 -07:00
youzhongyang	774a34f3ff	Synchronize the update of feature refcount The concurrent execution of feature_sync() can lead to a panic due to an unprotected update of the feature refcount. Resolve this by using the spa->spa_feat_stats_lock to synchronize the update of the refcount. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #17184 Closes #17632	2025-09-09 17:03:27 -07:00
Alexander Motin	94413bc75d	zdb: Filter log spacemaps by vdev When requested to dump metaslabs only for specific vdev, apply the filter also to log spacemaps to reduce the output. Unfortunately filtering by metaslab numbers is more difficult so leave those. While there, tune the output formatting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17643	2025-08-21 11:19:46 -04:00
Alan Somers	d3c1d27afd	zdb: better handling for corrupt block pointers When dumping indirect blocks, attempt to print corrupt block pointers rather than abort the program. When corruption is detected zdb will exit with an error code of 3. Sponsored by: ConnectWise Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17166	2025-08-12 14:16:37 -07:00
René Wirnata	1d0b94c4e7	zed: prettify slack notification message This converts the body of a ZED slack notification from plain text to code block style to help with readability. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: René Wirnata <rene.wirnata@pandascience.net> Closes #17610	2025-08-11 09:44:51 -07:00
Rob Norris	72602f6ad9	ZIL: "crash" the ZIL if the pool suspends during fallback If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks on suspend. We want it to not block on suspend, instead returning an error. On the surface, this is simple: change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller. Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit(). However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. (As an aside, ITXs can also supply completion callbacks, which are called when they are destroyed. These are directly connected to LWBs though, so are passed the error code and destroyed there too). At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to. The simplest solution I could find, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to three txgs in the future (syncing + TXG_CONCURRENT_STATES), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()). This commit adds zil_crash(), which handles all of the above: - sets the restart txg - capture and signal all waiters - zero the header zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN). The rest of the commit is just threading the errors through, and related housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:26 -07:00
Rob Norris	99a5f5d1ba	ZIL: pass commit errors back to ITX callbacks ITX callbacks are used to signal that something can be cleaned up after a itx is committed. Presently that's only used when syncing out mapped pages (msync()) to mark dirty pages clean. This extends the callback interface so it can be passed an error, and take a different cleanup action if necessary. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:20 -07:00
Rob Norris	967b15b888	ZIL: allow zil_commit() to fail with error This changes zil_commit() to have an int return, and updates all callers to check it. There are no corresponding internal changes yet; it will always return 0. Since zil_commit() is an indication that the caller _really_ wants the associated data to be durability stored, I've annotated it with the __warn_unused_result__ compiler attribute (via __must_check), to emit a warning if it's ever ussd without doing something with the return code. I hope this will mean we never misuse it in the future. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:09 -07:00
Rob Norris	82d6f7b047	Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:42 -07:00
Rob Norris	f7bdd84328	Prefer VERIFY0P(n) over VERIFY(n == NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:37 -07:00
Rob Norris	611b95da18	Prefer VERIFY0(n) over VERIFY3S(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:32 -07:00
Rob Norris	5c7df3bcac	Prefer VERIFY0(n) over VERIFY3U(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:25 -07:00
Rob Norris	c39e076f23	Prefer VERIFY0(n) over VERIFY(n == 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:40:59 -07:00
Mariusz Zaborski	0c376d0f59	Document the new '-a' zpool option Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes #17585	2025-08-06 17:11:47 -07:00
Alek P	3e004369f7	Removed unused zio_decompress_fail_fraction variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Closes #17599	2025-08-06 17:10:03 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00
Alexander Motin	4ae8bf406b	Allow physical rewrite without logical During regular block writes ZFS sets both logical and physical birth times equal to the current TXG. During dedup and block cloning logical birth time is still set to the current TXG, but physical may be copied from the original block that was used. This represents the fact that logically user data has changed, but the physically it is the same old block. But block rewrite introduces a new situation, when block is not changed logically, but stored in a different place of the pool. From ARC, scrub and some other perspectives this is a new block, but for example for user applications or incremental replication it is not. Somewhat similar thing happen during remap phase of device removal, but in that case space blocks are still acounted as allocated at their logical birth times. This patch introduces a new "rewrite" flag in the block pointer structure, allowing to differentiate physical rewrite (when the block is actually reallocated at the physical birth time) from the device reval case (when the logical birth time is used). The new functionality is not used at this point, and the only expected change is that error log is now kept in terms of physical physical birth times, rather than logical, since if a block with logged error was somehow rewritten, then the previous error does not matter any more. This change also introduces a new TRAVERSE_LOGICAL flag to the traverse code, allowing zfs send, redact and diff to work in context of logical birth times, ignoring physical-only rewrites. It also changes nothing at this point due to lack of those writes, but they will come in a following patch. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:07 -07:00
Mariusz Zaborski	894edd084e	Add TXG timestamp database This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG. With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16853	2025-08-06 10:31:21 -07:00

1 2 3 4 5 ...

1669 Commits