mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-08-06 15:07:39 +03:00

Author	SHA1	Message	Date
Alan Somers	fa2cdaa604	More aggressively assert that db_mtx protects db.db_data db.db_mtx must be held any time that db.db_data is accessed. All of these functions do have the lock held by a parent; add assertions to ensure that it stays that way. See https://github.com/openzfs/zfs/discussions/17118 * Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't required. * Refactor dbuf_hold_copy to eliminate the db_rwlock Copy data into the newly allocated buffer before assigning it to the db. That way, there will be no need to take db->db_rwlock. * Refactor dbuf_read_hole In the case of an indirect hole, initialize the newly allocated buffer before assigning it to the dmu_buf_impl_t. Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17209 (cherry picked from commit `c17bdc4914`)	2025-05-28 16:00:28 -07:00
Olivier Certner	51ed9640e9	FreeBSD: Use new SYSCTL_SIZEOF() SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9): Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in branch 'main'. It will soon be backported to 'stable/14'. We will thus be able to remove the old, alternate version left in the '#else' branch as soon as 'stable/13' goes out of support (April 30, 2026). Sponsored-by: The FreeBSD Foundation Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olivier Certner <olce@FreeBSD.org> Closes #17309 (cherry picked from commit `78628a5c15`)	2025-05-28 16:00:28 -07:00
Alexander Motin	9b446fbb60	ARC: Avoid overflows in arc_evict_adj() (#17255 ) With certain combinations of target ARC states balance and ghost hit rates it was possible to get the fractions outside of allowed range. This patch limits maximum balance adjustment speed, which should make it impossible, and also asserts it. Fixes #17210 Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> (cherry picked from commit `b1ccab1721`)	2025-05-28 16:00:28 -07:00
Rob Norris	ce9cd12c97	txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284 ) txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We expect that future development will require similar "wait unless X" behaviour. This generalises the API as txg_wait_synced_flags(), where the provided flags describe the events that should cause the call to return. Instead of a boolean, the return is now an error code, which the caller can use to know which event caused the call to return. The existing call to txg_wait_synced_sig() is now txg_wait_synced_flags(TXG_WAIT_SIGNAL). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> (cherry picked from commit `a7de203c86`)	2025-05-28 16:00:28 -07:00
Alexander Motin	101edf7ed9	Fix race between resilver wait and offline/detach We should not clear scn_state and notify waiters until we call vdev_dtl_reassess(), otherwise following offline/detach request may fail with "no valid replicas". Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. (cherry picked from commit `f86d9af16b`)	2025-05-28 16:00:28 -07:00
Tony Hutter	4b014840ea	Fix double spares for failed vdev It's possible for two spares to get attached to a single failed vdev. This happens when you have a failed disk that is spared, and then you replace the failed disk with a new disk, but during the resilver the new disk fails, and ZED kicks in a spare for the failed new disk. This commit checks for that condition and disallows it. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes: #16547 Closes: #17231 (cherry picked from commit `f40ab9e399`)	2025-05-28 16:00:28 -07:00
Rob Norris	c85f2fd531	cred: properly pass and test creds on other threads (#17273 ) ### Background Various admin operations will be invoked by some userspace task, but the work will be done on a separate kernel thread at a later time. Snapshots are an example, which are triggered through zfs_ioc_snapshot() -> dsl_dataset_snapshot(), but the actual work is from a task dispatched to dp_sync_taskq. Many such tasks end up in dsl_enforce_ds_ss_limits(), where various limits and permissions are enforced. Among other things, it is necessary to ensure that the invoking task (that is, the user) has permission to do things. We can't simply check if the running task has permission; it is a privileged kernel thread, which can do anything. However, in the general case it's not safe to simply query the task for its permissions at the check time, as the task may not exist any more, or its permissions may have changed since it was first invoked. So instead, we capture the permissions by saving CRED() in the user task, and then using it for the check through the secpolicy_* functions. ### Current implementation The current code calls CRED() to get the credential, which gets a pointer to the cred_t inside the current task and passes it to the worker task. However, it doesn't take a reference to the cred_t, and so expects that it won't change, and that the task continues to exist. In practice that is always the case, because we don't let the calling task return from the kernel until the work is done. For Linux, we also take a reference to the current task, because the Linux credential APIs for the most part do not check an arbitrary credential, but rather, query what a task can do. See secpolicy_zfs_proc(). Again, we don't take a reference on the task, just a pointer to it. ### Changes We change to calling crhold() on the task credential, and crfree() when we're done with it. This ensures it stays alive and unchanged for the duration of the call. On the Linux side, we change the main policy checking function priv_policy_ns() to use override_creds()/revert_creds() if necessary to make the provided credential active in the current task, allowing the standard task-permission APIs to do the needed check. Since the task pointer is no longer required, this lets us entirely remove secpolicy_zfs_proc() and the need to carry a task pointer around as well. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kyle Evans <kevans@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> (cherry picked from commit `c8fa39b46c`)	2025-05-28 16:00:28 -07:00
Brian Atkinson	a77d641f01	Export correct symbols for Lustre Direct I/O Originally the Lustre ZFS OSD code was going to use zfs_uio_t structs for supporting Direct I/O with ZFS. However, this has changed to using abd_t structs instead. This exports the proper symbols that will be used by the Lustre ZFS OSD code. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #17256 (cherry picked from commit `7031a48c70`)	2025-05-28 16:00:28 -07:00
Alexander Motin	c2424f8d1a	Improve L2 caching control for prefetched indirects dbuf_prefetch_impl() should look on level of current indirect, not the target prefetch level. dbuf_prefetch_indirect_done() should call dnode_level_is_l2cacheable() if we have dpa_dnode to pass it. It should fix some both false positive and negative L2ARC caching. While there, fix redacted feature activation assertions. One was always true, while another could give false positive if dpa_dnode is NULL. George Amanakis <gamanakis@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17204 (cherry picked from commit `a497c5fc8b`)	2025-05-28 16:00:28 -07:00
Alexander Motin	602fecc316	Prefer embedded blocks to dedup Since embedded blocks introduction 11 years ago, their writing was blocked if dedup is enabled. After searching through the modern code I see no reason for this restriction to exist. Same time embedded blocks are dramatically cheaper. Even regular write of so small blocks would likely be cheaper than deduplication, even if the last is successful, not mentioning otherwise. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17113 (cherry picked from commit `09f4dd06c3`)	2025-05-28 16:00:28 -07:00
Alexander Motin	588fa16830	ZAP: Reduce leaf array and free chunks fragmentation Previous implementation of zap_leaf_array_free() put chunks on the free list in reverse order. Also zap_leaf_transfer_entry() and zap_entry_remove() were freeing name and value arrays in reverse order. Together this created a mess in the free list, making following allocations much more fragmented than necessary. This patch re-implements zap_leaf_array_free() to keep existing chunks order, and implements non-destructive zap_leaf_array_copy() to be used in zap_leaf_transfer_entry() to allow properly ordered freeing name and value arrays there and in zap_entry_remove(). With this change test of some writes and deletes shows percent of non-contiguous chunks in DDT reducing from 61% and 47% to 0% and 17% for arrays and frees respectively. Sure some explicit sorting could do even better, especially for ZAPs with variable-size arrays, but it would also cost much more, while this should be very cheap. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16766 (cherry picked from commit `9a81484e35`)	2025-05-28 16:00:28 -07:00
Tony Hutter	36864e3d77	GCC 15: Fix unterminated-string-initialization (#17244 ) Fix build errors on Fedora 42 like: module/zcommon/zfs_valstr.c:193:16: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (3 chars into 2 available) The arrays in zpool_vdev_os.c and zfs_valstr.c don't need to be NULL terminated, but we do so to make GCC happy. Closes: #17242 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	fea534d1d0	gcm_avx_init: zero the ghash state after hashing the IV IVs != 96 bits get hashed with GHASH to bring them to 96 bits. Any call to GHASH will mix the ghash state in gcm_ghash. This is expected to be zero at first use in an encrypt or decrypt operation, so it needs to be zeroed after using GHASH in setup. gcm_init() does this, but gcm_avx_init() zeroed it before setup, not after, resulting in incorrect encrypt/decrypt results when using AVX GCM with an IV != 96 bits. OpenZFS _always_ uses a 96 bit IV (ZIO_DATA_IV_LEN) so this will never have been hit in any real-world use, which is extremely fortunate, as we would have incorrectly-encrypted data on-disk. Still, as long as we have this code here we should make sure it's correct. Thanks-to: Joel Low <joel@joelsplace.sg> Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Attila Fülöp <attila@fueloep.org>	2025-04-16 09:59:45 -07:00
Tony Hutter	20f00819f3	Linux 6.0 compat: Check for migratepage VFS (#17217 ) The 6.0 kernel removes the 'migratepage' VFS op. Check for migratepage. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org	2025-04-16 09:59:45 -07:00
Paul Dagnelie	fbac52e1e9	Fix FDT rollback to not overwrite unnecessary fields (#17205 ) When a dedup write fails, we try to roll the DDT entry back to a known good state. However, this also rolls the refcounts and the last-update time back to the state they were at when we started this write. This doesn't appear to be able to cause any refcount leaks (after the fix in 17123). This PR prevents that from happening by only rolling back the parts of the DDT entry that have been updated by the write so far. Sponsored-by: iXsystems, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	8539bdf568	[2.3.2] uconv: add SPDX license tag Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-16 09:59:45 -07:00
Martin Matuška	c312a988b5	freebsd: unbreak module/Makefile.bsd build on 15-CURRENT-arm64 - don't include foreign machine assembly files - reduce diff to FreeBSD module Makefile Discovered in FreeBSD port filesystems/openzfs-kmod Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #17219	2025-04-16 09:59:45 -07:00
Paul Dagnelie	bd5465e4eb	Fix nonrot property being incorrectly unset (#17206 ) When opening a vdev and setting the nonrot property, we used to wait for each child to be opened before examining its nonrot property. When the change was made to open vdevs asynchronously, we didn't move the nonrot check out of the main loop. As a result, the nonrot property is almost always set to false, regardless of the actual type of the underlying disks. The fix is simply to move the nonrot check to a separate loop after the taskq has been waited for. Sponsored-by: Klara, Inc. Sponsored-by: Eshtek, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2025-04-16 09:59:45 -07:00
Alexander Motin	6f2080f1ab	Fix lock reversal on device removal cancel FreeBSD kernel's WITNESS code detected lock ordering violation in spa_vdev_remove_cancel_sync(). It took svr_lock while holding ms_lock, which is opposite to other places. I was thinking to resolve it similar to #17145, but looking closer I don't think we even need svr_lock at that point, since we already asserted svr_allocd_segs is empty, and we don't need to add there segments we are going to call free_mapped_segment_cb for. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17164	2025-04-16 09:59:45 -07:00
Paul Dagnelie	9f0be8fca0	Fix dspace underflow bug Since spa_dspace accounts only normal allocation class space, spa_nonallocating_dspace should do the same. Otherwise we may get negative overflow or respective assertion spa_update_dspace() if removed special/dedup vdev is bigger than all normal class space. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17183	2025-04-16 09:59:45 -07:00
aokblast	153c982aac	spl_vfs: fix vrele task runner signature mismatch Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17101	2025-04-16 09:59:45 -07:00
Ameer Hamza	ab455c7b80	zed: Ensure spare activation after kernel-initiated device removal In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in a customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event. To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17187	2025-04-16 09:59:45 -07:00
Alexander Motin	e6f8c1f612	Block remap for cloned blocks on device removal When after device removal we handle block pointers remap, skip blocks that might be cloned. BRTs are indexed by vdev id and offset from block pointer's DVA[0]. So if we start addressing the same block by some different DVA, we won't get the proper reference counter. As result, we might either remap the block twice, that may result in assertion during indirect mapping condense, or free it prematurely, that may result in data overwrite, or free it twice, that may result in assertion in spacemap code. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15604 Closes #17180	2025-04-16 09:59:45 -07:00
Pavel Snajdr	c22f5c1c55	Linux: Fix zfs_prune panics v2 (#17121 ) It turns out that approach taken in the original version of the patch was wrong. So now, we're taking approach in-line with how kernel actually does it - when sb is being torn down, access to it is serialized via sb->s_umount rwsem, only when that lock is taken is it okay to work with s_flags - and the other mistake I was doing was trying to make SB_ACTIVE work, but apparently the kernel checks the negative variant - not SB_DYING and not SB_BORN. Kernels pre-6.6 don't have SB_DYING, but check if sb is hashed instead. Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:45 -07:00
Alexander Motin	a848b05b13	Fix deadlock on I/O errors during device removal spa_vdev_remove_thread() should not hold svr_lock while loading a metaslab. It may block ZIO threads, required to handle metaslab loading, at least in case of read errors causing recovery writes. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17145	2025-04-16 09:59:45 -07:00
Alan Somers	7cc60afb0b	Always perform bounds-checking in metaslab_free_concrete The vd->vdev_ms access can overflow due to on-disk corruption, not just due to programming bugs. So it makes sense to check its boundaries even in production builds. Sponsored by: ConnectWise Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17136	2025-04-16 09:59:45 -07:00
Rob Norris	9e009acbdc	dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143 ) This helps to avoids confusion with the similarly-named txg_wait_synced(). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	76d0c74c35	SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain SPDX have repeatedly rejected the creation of a tag for a public domain dedication, as not all dedications are clear and unambiguious in their meaning and not all jurisdictions permit relinquishing a copyright anyway. A reasonably common workaround appears to be to create a local (project-specific) identifier to convey whatever meaning the project wishes it to. To cover OpenZFS' use of third-party code with a public domain dedication, we use this custom tag. Further reading: - https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md - https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/ - https://cr.yp.to/spdx.html Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	c30a228608	SPDX: license tags: OpenSSL-standalone Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	846796c424	SPDX: license tags: Brian-Gladman-3-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	e4a2ab7c90	SPDX: license tags: BSD-2-Clause OR GPL-2.0-only Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	38468bbad6	SPDX: license tags: BSD-3-Clause OR GPL-2.0-only Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	6b2c046d18	SPDX: license tags: GPL-2.0-or-later Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	9070f890e1	SPDX: license tags: Apache-2.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	091da72c66	SPDX: license tags: MIT Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	8cacac7ed4	SPDX: license tags: BSD-3-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	865ca576ab	SPDX: license tags: BSD-2-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	9530eb64e0	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	3062b3866c	spa_sync_props: remove pool userprops by setting empty-string People have noted there's no way to remove a pool userprop, only zero it. Turns vdev userprops had a method, by setting empty-string. So this makes pool userprops follow the same behaviour. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16887	2025-04-16 09:59:43 -07:00
shodanshok	52f3f92bbf	Add receive:append permission for limited receive Force receive (zfs receive -F) can rollback or destroy snapshots and file systems that do not exist on the sending side (see zfs-receive man page). This means an user having the receive permission can effectively delete data on receiving side, even if such user does not have explicit rollback or destroy permissions. This patch adds the receive:append permission, which only permits limited, non-forced receive. Behavior for users with full receive permission is not changed in any way. Fixes #16943 Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17015	2025-04-02 17:06:40 -07:00
Alexander Motin	53cbf06d68	Fix deduplication of overridden blocks Implementation of DDT pruning introduced verification of DVAs in a block pointer during ddt_lookup() to not by mistake free previous pruned incarnation of the entry. But when writing a new block in zio_ddt_write() we might have the DVAs only from override pointer, which may never have "D" flag to be confused with pruned DDT entry, and we'll abandon those DVAs if we find a matching entry in DDT. This fixes deduplication for blocks written via dmu_sync() for purposes of indirect ZIL write records, that I have tested. And I suspect it might actually allow deduplication for Direct I/O, even though in an odd way -- first write block directly and then delete it later during TXG commit if found duplicate, which part I haven't tested. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17120	2025-04-02 17:05:24 -07:00
Rob Norris	6503f8c6f0	Linux/vnops: implement STATX_DIOALIGN This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16972	2025-04-02 17:04:14 -07:00
Alan Somers	ad07b09cc3	Verify every block pointer is either embedded, hole, or has a valid DVA Now instead of crashing when attempting to read the corrupt block pointer, ZFS will return ECKSUM, in a stack that looks like this: ``` none:set-error zfs.ko`arc_read+0x1d82 zfs.ko`dbuf_read+0xa8c zfs.ko`dmu_buf_hold_array_by_dnode+0x292 zfs.ko`dmu_read_uio_dnode+0x47 zfs.ko`zfs_read+0x2d5 zfs.ko`zfs_freebsd_read+0x7b kernel`VOP_READ_APV+0xd0 kernel`vn_read+0x20e kernel`vn_io_fault_doio+0x45 kernel`vn_io_fault1+0x15e kernel`vn_io_fault+0x150 kernel`dofileread+0x80 kernel`sys_read+0xb7 kernel`amd64_syscall+0x424 kernel`0xffffffff810633cb ``` This patch should hopefully also prevent such corrupt block pointers from being written to disk in the first place. And in zdb, don't crash when printing a block pointer with no valid DVAs. If a block pointer isn't embedded yet doesn't have any valid DVAs, that's a data corruption bug. zdb should be able to handle the situation gracefully. Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR. This check, which compares the asizes of two different DVAs within the same BP, was added by illumos-gate commit b24ab67[^1], and I can't understand why. It doesn't appear to do anything useful, so remove it. [^1]: `b24ab67627` Fixes #17077 Sponsored by: ConnectWise Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17078	2025-04-02 17:03:01 -07:00
Alexander Motin	f145371660	Check portable objset MAC even if local is zeroed PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors if local MAC can not be calculated at the time. But it does not mean we should also ignore portable MAC errors there. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17122	2025-04-02 17:03:01 -07:00
Rob Norris	5f7037067e	Revert "zinject: count matches and injections for each handler" (#17137 ) Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t, preventing some things working properly with 2.3.1 userspace tools against 2.3.0 kernel module. This reverts commit `fabdd502f4`. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-24 13:49:10 -07:00
Ameer Hamza	637f918211	arc: avoid possible deadlock in arc_read In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17071	2025-02-28 00:42:29 +05:00
Paul Dagnelie	7e72312eff	Don't try to get mg of hole vdev in removal Don't try to get mg of hole vdev in removal Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17080	2025-02-28 00:42:29 +05:00
aokblast	383256c329	spa: fix signature mismatch for spa_boot_init as eventhandler required Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17088	2025-02-28 00:42:29 +05:00
Alexander Motin	c2668b2d10	Better fill empty metaslabs Before this change zfs_metaslab_switch_threshold tunable switched metaslabs each time ones index reduced by two (which means biggest contiguous chunk reduced to 1/4). It is a good idea to balance metaslabs fragmentation. But for empty metaslabs (having power- of-2 sizes) this means switching when they get just below the half of their capacity. Inspection with zdb after filling new pool to half capacity shown most of its metaslabs filled to half capacity. I consider this sub-optimal for pool fragmentation in a long run. This change blocks the metaslabs switching if most of the metaslab free space (15/16) is represented by a single contiguous range. Such metaslab should not be considered fragmented until it actually fail some big allocation. More contiguous filling should improve data locality and increase time before previously filled and partially freed metaslab is touched again, giving it more time to free more contiguous chunks for lower fragmentation. It should also slightly reduce spacemap traffic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17081	2025-02-28 00:42:29 +05:00
Rob Norris	7ea899be04	vdev_file: make FLUSH and TRIM asynchronous zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the zio_taskq thread is active and blocked both while waiting for the IO call and then while calling zio_execute() for the next stage. This is a particular issue for FLUSH, as the z_flush_iss queue typically only has one thread; multiple flushes arriving at once can cause long delays if the underlying fsync() response is particularly slow. To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq, just as we do for reads and writes. Further, we return all results through zio_interrupt(), so neither the issue nor the file taskqs are blocked. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17064	2025-02-28 00:42:29 +05:00

1 2 3 4 5 ...

4844 Commits