mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-10-26 18:05:04 +03:00

Author	SHA1	Message	Date
Rob Norris	b00bc81b05	ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat These are only required to support these ioctls on Linux <4.5. Since 4.18 is our cutoff, we don't need this code anymore. Also removing related test things that will never match again. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17308	2025-06-17 10:50:27 -07:00
Alexander Motin	f7e6dcc68d	Relax zfs_vnops_read_chunk_size limitations It makes no sense to limit read size below the block size, since DMU will any way consume resources for the whole block, while the current zfs_vnops_read_chunk_size is only 1MB, which is smaller that maximum block size of 16MB. Plus in case of misaligned Uncached I/O the buffer may get evicted between the chunks, requiring repeating I/Os. On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB. It allows to less depend on speculative prefetcher if application requests specific size, first not waiting for prefetcher to start and later not prefetching more than needed. Also while there, we don't need to align reads to the chunk size, but only to a block size, which is smaller and so more forgiving. My profiles show ~4% of CPU time saving when reading 16MB blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17415	2025-06-17 10:50:27 -07:00
Rob Norris	e2de00ca44	dmu_traverse: remove 'ignore_hole_birth' tunable alias It's been many years, we can probably do without. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17376	2025-06-17 10:50:27 -07:00
Allan Jude	8e9ffe1b4f	ARC: parallel eviction On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com> Closes #16486	2025-06-17 10:50:26 -07:00
Don Brady	6b67a5bdd3	During pool export flush the ARC asynchronously This also includes removing L2 vdevs asynchronously. This commit also guarantees that spa_load_guid is unique. The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! So we guarantee that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16215	2025-06-17 10:50:26 -07:00
Rob Norris	a65225ec7e	FreeBSD: zfs_putpages: don't undirty pages until after write completes zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-17 10:50:26 -07:00
Rob Norris	9c0f5bc183	zfs_log_write: only put the callback on the last itx If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-17 10:50:26 -07:00
Rob Norris	e1dd433a44	zpl_sync_fs: work around kernels that ignore sync_fs errors If the kernel will honour our error returns, use them. If not, fool it by setting a writeback error on the superblock, if available. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-17 10:50:26 -07:00
Rob Norris	08cec6532e	zfs_sync: return error when pool suspends If the pool is suspended, we'll just block in zil_commit(). If the system is shutting down, blocking wouldn't help anyone. So, we should keep this test for now, but at least return an error for anyone who is actually interested. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-17 10:50:26 -07:00
Rob Norris	d944641502	zfs_sync: remove support for impossible scenarios The superblock pointer will always be set, as will z_log, so remove code supporting cases that can't occur (on Linux at least). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-17 10:50:26 -07:00
Alexander Motin	0c9cdd1606	Improve block cloning transactions accounting Previous dmu_tx_count_clone() was broken, stating that cloning is similar to free. While they might be from some points, cloning is not net-free. It will likely consume space and memory, and unlike free it will do it no matter whether the destination has the blocks or not (usually not, so previous code did nothing). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17431	2025-06-17 10:50:26 -07:00
Alexander Motin	23fad19818	Reduce zfs_dmu_offset_next_sync penalty Looking on txg_wait_synced(, 0) I've noticed that it always syncs 5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE. But in case of dmu_offset_next() we do not care about deferred frees. And even concurrent TXGs we might not need sync all 3 if the dnode was not dirtied in last few TXGs. This patch makes dmu_offset_next() to sync one TXG at a time until the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times. My tests with random simultaneous writes and seeks over many files on HDD pool show 7-14% performance increase. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17434	2025-06-17 10:50:26 -07:00
Alexander Motin	3897e86bd1	Make TX abort after assign safer It is not right, but there are few examples when TX is aborted after being assigned in case of error. To handle it better on production systems add extra cleanup steps. While here, replace couple dmu_tx_abort() in simple cases. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17438	2025-06-17 10:50:26 -07:00
Alexander Motin	4c8d0471fa	Allow zero compression if dedup is enabled Having high-refcount dedup entries for zero blocks is inefficient when they could be recorded as a holes instead. Normally, zero compression is not done if compression is disabled to not confuse naive benchmarks. But with dedup enabled, it is expected that the write will be skipped anyway, so we are just optimizing the way it is skipped. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17435	2025-06-17 10:50:26 -07:00
Attila Fülöp	e9c1e08e07	Linux build: silence objtool warnings After #17401 the Linux build produces some stack related warnings. Silence them with the `STACK_FRAME_NON_STANDARD` macro. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Attila Fülöp <attila@fueloep.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17410	2025-06-17 10:50:26 -07:00
Alexander Motin	bf4baee81e	Set spa_final_txg in spa_unload() I've noticed that after some dedup tests system reboot ends up in assertion about ms_defer tree not free. It seems to be caused by DDT flushing still freeing some blocks while ZFS trying to reach a final steady state due to spa_final_txg, while being set by spa_export_common() on pool export, is not set when spa_unload() is called by spa_evict_all() on system reboot/shutdown. Setting spa_final_txg in spa_unload() fixes this issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17395	2025-06-17 10:50:26 -07:00
Ameer Hamza	f292b0f146	vdev: skip faulting disks pending removal This patch fixes a race where vdev_remove_wanted may be set after probe initiation, which could otherwise trigger redundant fault and removal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17400	2025-06-17 10:50:26 -07:00
Rob Norris	04493ca819	linux/zvol_os: don't try to set disk ops if alloc fails If the kernel fails to allocate the gendisk, zvo_disk will be NULL, and derefencing it will explode. So don't do that. Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17396	2025-06-17 10:50:26 -07:00
Attila Fülöp	2c53fe7764	Linux build: always use objtool We silence `objtool` warnings on some object files using `OBJECT_FILES_NON_STANDARD_some_file.o`. Nowadays `objtool` is needed for CPU vulnerability mitigations and a lot more functionality so its use is desirable. Just remove the `OBJECT_FILES_NON_STANDARD` definitions. A follow-up commit is needed to make the offending files standard and address the compile time warnings. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #17401 Closes #17364	2025-06-17 10:50:26 -07:00
Rob Norris	d7bb6bbf13	tunables: fix spelling Three occurences with an 'e', and all of them mine. Maybe it's an British thing? Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	06fd6dc6f7	tunables: use Linux ullong param ops for u64 Since 3.17 Linux has provided param ops for 64-bit ints, so we don't need to use our own anymore. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	28ff5ff1c6	tunables: remove support for s64 tunables Nothing uses them now. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	e9002887e2	tunables: remove direct use of module_param_cb The use for spl_taskq_kick was the only use, and the comment that module_param_call is obsolete is no longer true - it's still very much used even in recent kernels. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	840b070ec7	tunables: remove FreeBSD compat macros for Linux module params Nothing in any FreeBSD code uses them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	d02d3add0d	tunables: ensure tunable and variable have same define gate If a variable is only available in the kernel, then the tunable should also only be available there. This matters very little so long as we don't have userspace tunables, but its still good hygeine. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	cc5724f38d	tunables: don't assert initialisation in impl getters It actually doesn't matter if it's not initialised when we first query the current value; it just returns empty-string. A crash is quite obnoxious even if it is a rare case. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Rob Norris	8317244270	zfs_log: make zfs_immediate_write_sz uint Likely it's only int64 for comparison with ssize_t, which is signed. However, it would make no sense for it to be less than 0 or greater than 4G, so making it a regular uint will make it safe for comparison and remove the only S64 tunable in core. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-06-17 10:50:26 -07:00
Paul Dagnelie	65cf521353	Only interrupt active disk I/Os in failmode=continue failmode=continue is in a sorry state. Originally designed to fix a very specific problem, it causes crashes and panics for most people who end up trying to use it. At this point, we should either remove it entirely, or try to make it more usable. With this patch, I choose the latter. While the feature is fundamentally unpredictable and prone to race conditions, it should be possible to get it to the point where it can at least sometimes be useful for some users. This patch fixes one of the major issues with failmode=continue: it interrupts even ZIOs that are patiently waiting in line behind stuck IOs. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17372	2025-06-17 10:49:40 -07:00
Pavel Snajdr	08caad8257	zcp: get_prop: fix encryptionroot and encryption It was reported that channel programs' zfs.get_prop doesn't work for dataset properties encryption and encryptionroot. They are handled in get_special_prop due to the need to call dsl_dataset_crypt_stats to load those dsl props. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Co-authored-by: Graham Christensen <graham@grahamc.com> Closes #17280	2025-06-17 10:49:40 -07:00
Fedor Uporov	d187e3e1a7	ZVOL: Comment platform-specific empty functions bodies on FreeBSD side Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17383	2025-06-17 10:49:40 -07:00
Ameer Hamza	1215c3b609	Expose dataset encryption status via fast stat path In truenas_pylibzfs, we query list of encrypted datasets several times, which is expensive. This commit exposes a public API zfs_is_encrypted() to get encryption status from fast stat path without having to refresh the properties. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17368	2025-06-17 10:49:40 -07:00
Alexander Motin	2fe0d5df94	ZIL: Improve write log size accounting Before this change write log size TXG throttling mechanism was accounting only user payload bytes. But the actual ZIL both on disk and especially in memory include headers of hundred(s) of bytes. Not accouting those may allow applications like bonnie++, in their wisdom writing one byte at a time, to consume excessive amount of memory and ZIL/SLOG in one TXG. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17373	2025-06-17 10:49:40 -07:00
Paul Dagnelie	b9324a1e75	Fix off-by-one bug in range tree code Without this fix, zfs_range_tree_find_in could return an overlap when the found range starts immediately after the searched range, with no actual overlap. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17363	2025-06-17 10:49:40 -07:00
Alexander Motin	64e77fdf3b	Fix null dereference in spa_vdev_remove_cancel_sync() We don't really need to access space map to know where the metaslab ends, while msp->ms_sm might be NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17164 Fixes #17359 Closes #17361 (cherry picked from commit `5c30b24381`)	2025-05-28 16:00:28 -07:00
Richard Yao	25ad9ce692	dmu_objset_hold_flags() should call dsl_dataset_rele_flags() on error This was caught when doing a manual check to see if #17352 needed to be improved to catch mismatches across stack frames of the kind that were first found in #17340. Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard@ryao.dev> Closes #17353 (cherry picked from commit `83fa80a550`)	2025-05-28 16:00:28 -07:00
Rob Norris	8c0f7619b2	Linux 6.2/6.15: del_timer_sync() renamed to timer_delete_sync() Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or functional change apart from that, so a very minimal update for us. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17229 (cherry picked from commit `841be1d049`)	2025-05-28 16:00:28 -07:00
Rob Norris	e64d4718a7	Linux 6.15: mkdir now returns struct dentry * The intent is that the filesystem may have a reference to an "old" version of the new directory, eg if it was keeping it alive because a remote NFS client still had it open. We don't need anything like that, so this really just changes things so we return error codes encoded in pointers. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17229 (cherry picked from commit `bb740d66de`)	2025-05-28 16:00:28 -07:00
Richard Yao	a4de1d38da	icp: Use explicit_memset() exclusively in gcm_clear_ctx() `d634d20d1b` had been intended to fix a potential information leak issue where the compiler's optimization passes appeared to remove `memset()` operations that sanitize sensitive data before memory is freed for use by the rest of the kernel. When I wrote it, I had assumed that the compiler would not remove the other `memset()` operations, but upon reflection, I have realized that this was a bad assumption to make. I would rather have a very slight amount of additional overhead when calling `gcm_clear_ctx()` than risk a future compiler remove `memset()` calls. This is likely to happen if someone decides to try doing link time optimization and the person will not think to audit the assembly output for issues like this, so it is best to preempt the possibility before it happens. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard@ryao.dev> Closes #17343 (cherry picked from commit `d8a33bc0a5`)	2025-05-28 16:00:28 -07:00
George Amanakis	f28c685a84	Fix 2 bugs in non-raw send with encryption Bisecting identified the redacted send/receive as the source of the bug for issue #12014. Specifically the call to dsl_dataset_hold_obj(&fromds) has been replaced by dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing and the key mapping is not cleared. This may be inadvertedly used, which results in arc_untransform failing with ECKSUM in: arc_untransform+0x96/0xb0 [zfs] dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs] dbuf_read+0x56/0x770 [zfs] dmu_buf_hold_by_dnode+0x4a/0x80 [zfs] zap_lockdir+0x87/0xf0 [zfs] zap_lookup_norm+0x5c/0xd0 [zfs] zap_lookup+0x16/0x20 [zfs] zfs_get_zplprop+0x8d/0x1d0 [zfs] setup_featureflags+0x267/0x2e0 [zfs] dmu_send_impl+0xe7/0xcb0 [zfs] dmu_send_obj+0x265/0x360 [zfs] zfs_ioc_send+0x10c/0x280 [zfs] Fix this by restoring the call to dsl_dataset_hold_obj(). The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with dsl_dataset_rele_flags(). Both leaked key mappings will cause a panic when exporting the sending pool or unloading the zfs module after a non-raw send from an encrypted filesystem. Contributions-by: Hank Barta <hbarta@gmail.com> Contributions-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12014 Closes #17340 (cherry picked from commit `ea74cdedda`)	2025-05-28 16:00:28 -07:00
Paul Dagnelie	6517ebf4da	Cause zpool scan resume commands to get logged in history Currently, commands that resume a scrub/errorscrub from a paused state don't get logged in the pool history. This is because resumes actually return ECANCELED, instead of 0. This causes the tsd code in the common ioctl logic to not think the ioctl succeeded, which causes the log_history ioctl to fail with EPERM. However, for resuming a scrub from a paused state, ECANCELED is success. There are two options for how to deal with this. The first is the one that I implemented here; I can't find a good reason for dmu_scan to return ECANCELED on resume instead of 0, so let's just not. The only place we check for the ECANCELED value is in zpool_scan, where we just convert it back to zero. However, I am aware that this is changing an ioctl interface, which I believe is a breaking change. I don't think it's an important change, but maybe there is someone who relies on it. The other option that could be implemented is to either allow ECANCELED specifically from dsl_scan in the common ioctl code, or add a generic facility to the common ioctl code that allows each command to specify whether or not success happened, regardless of the return values. I am open to feedback on which option people think would be better. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17301 (cherry picked from commit `086105f4c4`)	2025-05-28 16:00:28 -07:00
Alexander Motin	97c1fb6ad5	ARC: Notify dbuf cache about target size reduction ARC target size might drop significantly under memory pressure, especially if current ARC size was much smaller than the target. Since dbuf cache size is a fraction of the target ARC size, it might need eviction too. Aside of memory from the dbuf eviction itself, it might help ARC by making more buffers evictable. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17314 (cherry picked from commit `89a8a91582`)	2025-05-28 16:00:28 -07:00
Mariusz Zaborski	1c11d3a54f	spa: clear checkpoint information during retry Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org> Closes #17319 (cherry picked from commit `8b9c4e643b`)	2025-05-28 16:00:28 -07:00
Rob Norris	db988fabfb	linux/uio: remove "skip" offset for UIO_ITER For UIO_ITER, we are just wrapping a kernel iterator. It will take care of its own offsets if necessary. We don't need to do anything, and if we do try to do anything with it (like advancing the iterator by the skip in zfs_uio_advance) we're just confusing the kernel iterator, ending up at the wrong position or worse, off the end of the memory region. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17298 (cherry picked from commit `2ee5b51a57`)	2025-05-28 16:00:28 -07:00
Alan Somers	fa2cdaa604	More aggressively assert that db_mtx protects db.db_data db.db_mtx must be held any time that db.db_data is accessed. All of these functions do have the lock held by a parent; add assertions to ensure that it stays that way. See https://github.com/openzfs/zfs/discussions/17118 * Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't required. * Refactor dbuf_hold_copy to eliminate the db_rwlock Copy data into the newly allocated buffer before assigning it to the db. That way, there will be no need to take db->db_rwlock. * Refactor dbuf_read_hole In the case of an indirect hole, initialize the newly allocated buffer before assigning it to the dmu_buf_impl_t. Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17209 (cherry picked from commit `c17bdc4914`)	2025-05-28 16:00:28 -07:00
Olivier Certner	51ed9640e9	FreeBSD: Use new SYSCTL_SIZEOF() SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9): Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in branch 'main'. It will soon be backported to 'stable/14'. We will thus be able to remove the old, alternate version left in the '#else' branch as soon as 'stable/13' goes out of support (April 30, 2026). Sponsored-by: The FreeBSD Foundation Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olivier Certner <olce@FreeBSD.org> Closes #17309 (cherry picked from commit `78628a5c15`)	2025-05-28 16:00:28 -07:00
Alexander Motin	9b446fbb60	ARC: Avoid overflows in arc_evict_adj() (#17255 ) With certain combinations of target ARC states balance and ghost hit rates it was possible to get the fractions outside of allowed range. This patch limits maximum balance adjustment speed, which should make it impossible, and also asserts it. Fixes #17210 Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> (cherry picked from commit `b1ccab1721`)	2025-05-28 16:00:28 -07:00
Rob Norris	ce9cd12c97	txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284 ) txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We expect that future development will require similar "wait unless X" behaviour. This generalises the API as txg_wait_synced_flags(), where the provided flags describe the events that should cause the call to return. Instead of a boolean, the return is now an error code, which the caller can use to know which event caused the call to return. The existing call to txg_wait_synced_sig() is now txg_wait_synced_flags(TXG_WAIT_SIGNAL). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> (cherry picked from commit `a7de203c86`)	2025-05-28 16:00:28 -07:00
Alexander Motin	101edf7ed9	Fix race between resilver wait and offline/detach We should not clear scn_state and notify waiters until we call vdev_dtl_reassess(), otherwise following offline/detach request may fail with "no valid replicas". Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. (cherry picked from commit `f86d9af16b`)	2025-05-28 16:00:28 -07:00
Tony Hutter	4b014840ea	Fix double spares for failed vdev It's possible for two spares to get attached to a single failed vdev. This happens when you have a failed disk that is spared, and then you replace the failed disk with a new disk, but during the resilver the new disk fails, and ZED kicks in a spare for the failed new disk. This commit checks for that condition and disallows it. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: #16547 Closes: #17231 (cherry picked from commit `f40ab9e399`)	2025-05-28 16:00:28 -07:00
Rob Norris	c85f2fd531	cred: properly pass and test creds on other threads (#17273 ) ### Background Various admin operations will be invoked by some userspace task, but the work will be done on a separate kernel thread at a later time. Snapshots are an example, which are triggered through zfs_ioc_snapshot() -> dsl_dataset_snapshot(), but the actual work is from a task dispatched to dp_sync_taskq. Many such tasks end up in dsl_enforce_ds_ss_limits(), where various limits and permissions are enforced. Among other things, it is necessary to ensure that the invoking task (that is, the user) has permission to do things. We can't simply check if the running task has permission; it is a privileged kernel thread, which can do anything. However, in the general case it's not safe to simply query the task for its permissions at the check time, as the task may not exist any more, or its permissions may have changed since it was first invoked. So instead, we capture the permissions by saving CRED() in the user task, and then using it for the check through the secpolicy_* functions. ### Current implementation The current code calls CRED() to get the credential, which gets a pointer to the cred_t inside the current task and passes it to the worker task. However, it doesn't take a reference to the cred_t, and so expects that it won't change, and that the task continues to exist. In practice that is always the case, because we don't let the calling task return from the kernel until the work is done. For Linux, we also take a reference to the current task, because the Linux credential APIs for the most part do not check an arbitrary credential, but rather, query what a task can do. See secpolicy_zfs_proc(). Again, we don't take a reference on the task, just a pointer to it. ### Changes We change to calling crhold() on the task credential, and crfree() when we're done with it. This ensures it stays alive and unchanged for the duration of the call. On the Linux side, we change the main policy checking function priv_policy_ns() to use override_creds()/revert_creds() if necessary to make the provided credential active in the current task, allowing the standard task-permission APIs to do the needed check. Since the task pointer is no longer required, this lets us entirely remove secpolicy_zfs_proc() and the need to carry a task pointer around as well. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kyle Evans <kevans@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> (cherry picked from commit `c8fa39b46c`)	2025-05-28 16:00:28 -07:00

1 2 3 4 5 ...

4887 Commits