mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 10:54:35 +03:00

Author	SHA1	Message	Date
Don Brady	6b67a5bdd3	During pool export flush the ARC asynchronously This also includes removing L2 vdevs asynchronously. This commit also guarantees that spa_load_guid is unique. The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! So we guarantee that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16215	2025-06-17 10:50:26 -07:00
Alexander Motin	0c9cdd1606	Improve block cloning transactions accounting Previous dmu_tx_count_clone() was broken, stating that cloning is similar to free. While they might be from some points, cloning is not net-free. It will likely consume space and memory, and unlike free it will do it no matter whether the destination has the blocks or not (usually not, so previous code did nothing). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17431	2025-06-17 10:50:26 -07:00
Attila Fülöp	e9c1e08e07	Linux build: silence objtool warnings After #17401 the Linux build produces some stack related warnings. Silence them with the `STACK_FRAME_NON_STANDARD` macro. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Attila Fülöp <attila@fueloep.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17410	2025-06-17 10:50:26 -07:00
Ameer Hamza	1215c3b609	Expose dataset encryption status via fast stat path In truenas_pylibzfs, we query list of encrypted datasets several times, which is expensive. This commit exposes a public API zfs_is_encrypted() to get encryption status from fast stat path without having to refresh the properties. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17368	2025-06-17 10:49:40 -07:00
Alexander Motin	97c1fb6ad5	ARC: Notify dbuf cache about target size reduction ARC target size might drop significantly under memory pressure, especially if current ARC size was much smaller than the target. Since dbuf cache size is a fraction of the target ARC size, it might need eviction too. Aside of memory from the dbuf eviction itself, it might help ARC by making more buffers evictable. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17314 (cherry picked from commit `89a8a91582`)	2025-05-28 16:00:28 -07:00
Rob Norris	ce9cd12c97	txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284 ) txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We expect that future development will require similar "wait unless X" behaviour. This generalises the API as txg_wait_synced_flags(), where the provided flags describe the events that should cause the call to return. Instead of a boolean, the return is now an error code, which the caller can use to know which event caused the call to return. The existing call to txg_wait_synced_sig() is now txg_wait_synced_flags(TXG_WAIT_SIGNAL). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> (cherry picked from commit `a7de203c86`)	2025-05-28 16:00:28 -07:00
Rob Norris	c85f2fd531	cred: properly pass and test creds on other threads (#17273 ) ### Background Various admin operations will be invoked by some userspace task, but the work will be done on a separate kernel thread at a later time. Snapshots are an example, which are triggered through zfs_ioc_snapshot() -> dsl_dataset_snapshot(), but the actual work is from a task dispatched to dp_sync_taskq. Many such tasks end up in dsl_enforce_ds_ss_limits(), where various limits and permissions are enforced. Among other things, it is necessary to ensure that the invoking task (that is, the user) has permission to do things. We can't simply check if the running task has permission; it is a privileged kernel thread, which can do anything. However, in the general case it's not safe to simply query the task for its permissions at the check time, as the task may not exist any more, or its permissions may have changed since it was first invoked. So instead, we capture the permissions by saving CRED() in the user task, and then using it for the check through the secpolicy_* functions. ### Current implementation The current code calls CRED() to get the credential, which gets a pointer to the cred_t inside the current task and passes it to the worker task. However, it doesn't take a reference to the cred_t, and so expects that it won't change, and that the task continues to exist. In practice that is always the case, because we don't let the calling task return from the kernel until the work is done. For Linux, we also take a reference to the current task, because the Linux credential APIs for the most part do not check an arbitrary credential, but rather, query what a task can do. See secpolicy_zfs_proc(). Again, we don't take a reference on the task, just a pointer to it. ### Changes We change to calling crhold() on the task credential, and crfree() when we're done with it. This ensures it stays alive and unchanged for the duration of the call. On the Linux side, we change the main policy checking function priv_policy_ns() to use override_creds()/revert_creds() if necessary to make the provided credential active in the current task, allowing the standard task-permission APIs to do the needed check. Since the task pointer is no longer required, this lets us entirely remove secpolicy_zfs_proc() and the need to carry a task pointer around as well. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Kyle Evans <kevans@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> (cherry picked from commit `c8fa39b46c`)	2025-05-28 16:00:28 -07:00
Paul Dagnelie	fbac52e1e9	Fix FDT rollback to not overwrite unnecessary fields (#17205 ) When a dedup write fails, we try to roll the DDT entry back to a known good state. However, this also rolls the refcounts and the last-update time back to the state they were at when we started this write. This doesn't appear to be able to cause any refcount leaks (after the fix in 17123). This PR prevents that from happening by only rolling back the parts of the DDT entry that have been updated by the write so far. Sponsored-by: iXsystems, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-16 09:59:45 -07:00
Paul Dagnelie	9f0be8fca0	Fix dspace underflow bug Since spa_dspace accounts only normal allocation class space, spa_nonallocating_dspace should do the same. Otherwise we may get negative overflow or respective assertion spa_update_dspace() if removed special/dedup vdev is bigger than all normal class space. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17183	2025-04-16 09:59:45 -07:00
Ameer Hamza	ab455c7b80	zed: Ensure spare activation after kernel-initiated device removal In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in a customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event. To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17187	2025-04-16 09:59:45 -07:00
Rob Norris	9e009acbdc	dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143 ) This helps to avoids confusion with the similarly-named txg_wait_synced(). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	76d0c74c35	SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain SPDX have repeatedly rejected the creation of a tag for a public domain dedication, as not all dedications are clear and unambiguious in their meaning and not all jurisdictions permit relinquishing a copyright anyway. A reasonably common workaround appears to be to create a local (project-specific) identifier to convey whatever meaning the project wishes it to. To cover OpenZFS' use of third-party code with a public domain dedication, we use this custom tag. Further reading: - https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md - https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/ - https://cr.yp.to/spdx.html Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:45 -07:00
Rob Norris	6b2c046d18	SPDX: license tags: GPL-2.0-or-later Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	091da72c66	SPDX: license tags: MIT Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	8cacac7ed4	SPDX: license tags: BSD-3-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
Rob Norris	9530eb64e0	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-04-16 09:59:44 -07:00
shodanshok	52f3f92bbf	Add receive:append permission for limited receive Force receive (zfs receive -F) can rollback or destroy snapshots and file systems that do not exist on the sending side (see zfs-receive man page). This means an user having the receive permission can effectively delete data on receiving side, even if such user does not have explicit rollback or destroy permissions. This patch adds the receive:append permission, which only permits limited, non-forced receive. Behavior for users with full receive permission is not changed in any way. Fixes #16943 Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17015	2025-04-02 17:06:40 -07:00
Alexander Motin	53cbf06d68	Fix deduplication of overridden blocks Implementation of DDT pruning introduced verification of DVAs in a block pointer during ddt_lookup() to not by mistake free previous pruned incarnation of the entry. But when writing a new block in zio_ddt_write() we might have the DVAs only from override pointer, which may never have "D" flag to be confused with pruned DDT entry, and we'll abandon those DVAs if we find a matching entry in DDT. This fixes deduplication for blocks written via dmu_sync() for purposes of indirect ZIL write records, that I have tested. And I suspect it might actually allow deduplication for Direct I/O, even though in an odd way -- first write block directly and then delete it later during TXG commit if found duplicate, which part I haven't tested. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17120	2025-04-02 17:05:24 -07:00
Rob Norris	6503f8c6f0	Linux/vnops: implement STATX_DIOALIGN This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16972	2025-04-02 17:04:14 -07:00
Alan Somers	ad07b09cc3	Verify every block pointer is either embedded, hole, or has a valid DVA Now instead of crashing when attempting to read the corrupt block pointer, ZFS will return ECKSUM, in a stack that looks like this: ``` none:set-error zfs.ko`arc_read+0x1d82 zfs.ko`dbuf_read+0xa8c zfs.ko`dmu_buf_hold_array_by_dnode+0x292 zfs.ko`dmu_read_uio_dnode+0x47 zfs.ko`zfs_read+0x2d5 zfs.ko`zfs_freebsd_read+0x7b kernel`VOP_READ_APV+0xd0 kernel`vn_read+0x20e kernel`vn_io_fault_doio+0x45 kernel`vn_io_fault1+0x15e kernel`vn_io_fault+0x150 kernel`dofileread+0x80 kernel`sys_read+0xb7 kernel`amd64_syscall+0x424 kernel`0xffffffff810633cb ``` This patch should hopefully also prevent such corrupt block pointers from being written to disk in the first place. And in zdb, don't crash when printing a block pointer with no valid DVAs. If a block pointer isn't embedded yet doesn't have any valid DVAs, that's a data corruption bug. zdb should be able to handle the situation gracefully. Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR. This check, which compares the asizes of two different DVAs within the same BP, was added by illumos-gate commit b24ab67[^1], and I can't understand why. It doesn't appear to do anything useful, so remove it. [^1]: https://github.com/illumos/illumos-gate/commit/b24ab6762772a3f6a89393947930c7fa61306783 Fixes #17077 Sponsored by: ConnectWise Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17078	2025-04-02 17:03:01 -07:00
Rob Norris	5f7037067e	Revert "zinject: count matches and injections for each handler" (#17137 ) Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t, preventing some things working properly with 2.3.1 userspace tools against 2.3.0 kernel module. This reverts commit `fabdd502f4`. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-24 13:49:10 -07:00
Ameer Hamza	637f918211	arc: avoid possible deadlock in arc_read In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17071	2025-02-28 00:42:29 +05:00
aokblast	383256c329	spa: fix signature mismatch for spa_boot_init as eventhandler required Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17088	2025-02-28 00:42:29 +05:00
Rob Norris	92d1686a2a	include: move zio_priority_t into zfs.h It's included so it's effectively already part of it, but it's not always installed as a userspace header, making zfs.h effectively useless. Might as well just combine it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Close #17066	2025-02-28 00:42:29 +05:00
Rob Norris	1bdce0410c	range_tree: convert remaining range_* defs to zfs_range_* Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-28 00:42:29 +05:00
Ivan Volosyuk	55b21552d3	Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_* Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-28 00:42:29 +05:00
Alan Somers	6e9911212e	Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #16980	2025-02-25 22:32:11 +05:00
Rob Norris	a28f5a94f4	zinject: add "probe" device injection type Injecting a device probe failure is not possible by matching IO types, because probe IO goes to the label regions, which is explicitly excluded from injection. Even if it were possible, it would be awkward to do, because a probe is sequence of reads and writes. This commit adds a new IO "type" to match for injection, which looks for the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the injection record and recieve the wanted error. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-02-25 22:29:33 +05:00
Rob Norris	0dfcfe023e	zinject: make iotype extendable I'm about to add a new "type", and I need somewhere to put it! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-02-25 22:29:02 +05:00
Peng Liu	404254bacb	style: remove unnecessary spaces in sa.h Removed three unnecessary spaces in the definition of the sa_attr_reg_t structure to improve code style consistency and adhere to OpenZFS coding standards. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Peng Liu <littlenewton6@gmail.com> Closes #16955	2025-02-25 22:26:45 +05:00
Rob Norris	fabdd502f4	zinject: count matches and injections for each handler When building tests with zinject, it can be quite difficult to work out if you're producing the right kind of IO to match the rules you've set up. So, here we extend injection records to count the number of times a handler matched the operation, and how often an error was actually injected (ie after frequency and other exclusions are applied). Then, display those counts in the `zinject` output. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16938	2025-02-25 22:25:24 +05:00
Ameer Hamza	b952e061df	zvol: implement platform-independent part of block cloning In Linux, block devices currently lack support for `copy_file_range` API because the kernel does not provide the necessary functionality. However, there is an ongoing upstream effort to address this limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/. We have adopted this upstream kernel patch into the TrueNAS kernel and made some additional modifications to enable block cloning specifically for the zvol block device. This patch implements the platform- independent portions of these changes for inclusion in OpenZFS. This patch does not introduce any new functionality directly into OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when zvols are migrated to non-TrueNAS systems that support Clone Range replay in the ZIL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16901	2024-12-29 11:53:45 -08:00
Alexander Motin	d90042dedb	Allow dsl_deadlist_open() return errors In some cases like dsl_dataset_hold_obj() it is possible to handle those errors, so failure to hold dataset should be better than kernel panic. Some other places where these errors are still not handled but asserted should be less dangerous just as unreachable. We have a user report about pool corruption leading to assertions on these errors. Hopefully this will make behavior a bit nicer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16836	2024-12-05 09:33:21 -08:00
Mariusz Zaborski	3b0c1131ef	Add ability to scrub from last scrubbed txg Some users might want to scrub only new data because they would like to know if the new write wasn't corrupted. This PR adds possibility scrub only newly written data. This introduces new `last_scrubbed_txg` property, indicating the transaction group (TXG) up to which the most recent scrub operation has checked and repaired the dataset, so users can run scrub only from the last saved point. We use a scn_max_txg and scn_min_txg which are already built into scrub, to accomplish that. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Closes #16301	2024-12-05 09:33:21 -08:00
Alexander Motin	1a5414ba2f	BRT: More optimizations after per-vdev splitting - With both pending and current AVL-trees being per-vdev and having effectively identical comparison functions (pending tree compared also birth time, but I don't believe it is possible for them to be different for the same offset within one transaction group), it makes no sense to move entries from one to another. Instead inline dramatically simplified brt_entry_addref() into brt_pending_apply(). It no longer requires bv_lock, since there is nothing concurrent to it at the time. And it does not need to search the tree for the previous entries, since it is the same tree, we already have the entry and we know it is unique. - Put brt_vdev_lookup() and brt_vdev_addref() into different tree traversals to avoid false positives in the first due to the second entcount modifications. It saves dramatic amount of time when a file cloned first time by not looking for non-existent ZAP entries. - Remove avl_is_empty(bv_tree) check from brt_maybe_exists(). I don't think it is needed, since by the time all added entries are already accounted in bv_entcount. The extra check must be producing too many false positives for no reason. Also we don't need bv_lock there, since bv_entcount pointer must be table at this point, and we don't care about false positive races here, while false negative should be impossible, since all brt_vdev_addref() have already completed by this point. This dramatically reduces lock contention on massive deletes of cloned blocks. The only remaining one is between multiple parallel free threads calling brt_entry_decref(). - Do not update ZAP if net change for a block over the TXG was 0. In combination with above it makes file move between datasets as cheap operation as originally intended if it fits into one TXG. - Do not allocate vdevs on pool creation or import if it did not have active block cloning. This allows to save a bit in few cases. - While here, add proper error handling in brt_load() on pool import instead of assertions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16773	2024-11-21 08:24:37 -08:00
Alexander Motin	409aad3f33	BRT: Rework structures and locks to be per-vdev While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-21 08:24:37 -08:00
Alexander Motin	1917c26944	ZAP: Add by_dnode variants to lookup/prefetch_uint64 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-21 08:24:37 -08:00
Alexander Motin	3f9af023f6	L2ARC: Stop rebuild before setting spa_final_txg Without doing that there is a race window on export when history log write by completed rebuild dirties transaction beyond final, triggering assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16714 Closes #16782	2024-11-21 08:24:37 -08:00
Alexander Motin	f7675ae30f	Remove hash_elements_max accounting from DBUF and ARC Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16759	2024-11-21 08:24:37 -08:00
Alexander Motin	55cbd1f9bd	Reduce dirty records memory usage Small block workloads may use a very large number of dirty records. During simple block cloning test due to BRT still using 4KB blocks I can easily see up to 2.5M of those used. Before this change dbuf_dirty_record_t structures representing them were allocated via kmem_zalloc(), that rounded their size up to 512 bytes. Introduction of specialized kmem cache allows to reduce the size from 512 to 408 bytes. Additionally, since override and raw params in dirty records are mutually exclusive, puting them into a union allows to reduce structure size down to 368 bytes, increasing the saving to 28%, that can be a 0.5GB or more of RAM. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16694	2024-11-04 16:46:39 -08:00
Alexander Motin	7e3ce4efaa	Pack dmu_buf_impl_t by 16 bytes On 64bit FreeBSD this reduces one from 296 to 280 bytes. On small block workloads dbufs may consume gigabytes of ARC, and this saves 5% of it. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16684	2024-11-01 09:48:42 -07:00
Brian Atkinson	26ecd8b993	Always validate checksums for Direct I/O reads This fixes an oversight in the Direct I/O PR. There is nothing that stops a process from manipulating the contents of a buffer for a Direct I/O read while the I/O is in flight. This can lead checksum verify failures. However, the disk contents are still correct, and this would lead to false reporting of checksum validation failures. To remedy this, all Direct I/O reads that have a checksum verification failure are treated as suspicious. In the event a checksum validation failure occurs for a Direct I/O read, then the I/O request will be reissued though the ARC. This allows for actual validation to happen and removes any possibility of the buffer being manipulated after the I/O has been issued. Just as with Direct I/O write checksum validation failures, Direct I/O read checksum validation failures are reported though zpool status -d in the DIO column. Also the zevent has been updated to have both: 1. dio_verify_wr -> Checksum verification failure for writes 2. dio_verify_rd -> Checksum verification failure for reads. This allows for determining what I/O operation was the culprit for the checksum verification failure. All DIO errors are reported only on the top-level VDEV. Even though FreeBSD can write protect pages (stable pages) it still has the same issue as Linux with Direct I/O reads. This commit updates the following: 1. Propogates checksum failures for reads all the way up to the top-level VDEV. 2. Reports errors through zpool status -d as DIO. 3. Has two zevents for checksum verify errors with Direct I/O. One for read and one for write. 4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and handle ABD buffer contents validation the same as Linux. 5. Updated manipulate_user_buffer.c to also manipulate a buffer while a Direct I/O read is taking place. 6. Adds a new ZTS test case dio_read_verify that stress tests the new code. 7. Updated man pages. 8. Added an IMPLY statement to zio_checksum_verify() to make sure that Direct I/O reads are not issued as speculative. 9. Removed self healing through mirror, raidz, and dRAID VDEVs for Direct I/O reads. This issue was first observed when installing a Windows 11 VM on a ZFS dataset with the dataset property direct set to always. The zpool devices would report checksum failures, but running a subsequent zpool scrub would not repair any data and report no errors. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #16598	2024-10-09 13:45:06 -07:00
Alexander Motin	42ce4b11e7	ZAP: Align za_name in zap_attribute_t to 8 bytes Our code reading/writing there may not handle misaligned accesses on a platforms that may care about it. I don't see a point to complicate it to satisfy UBSan in CI. This alignment costs nothing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15921 Closes #16606	2024-10-04 11:06:26 -07:00
Rob Norris	224393a321	feature: large_microzap In `a4b21eadec` we added the zap_micro_max_size tuneable to raise the size at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block) ZAPs. Before this, a microZAP was limited to 128KiB, which was the old largest block size. The side effect of raising the max size past 128KiB is that it be stored in a large block, requiring the large_blocks feature. Unfortunately, this means that a backup stream created without the --large-block (-L) flag to zfs send would split the microZAP block into smaller blocks and send those, as is normal behaviour for large blocks. This would be received correctly, but since microZAPs are limited to the first block in the object by definition, the entries in the later blocks would be inaccessible. For directory ZAPs, this gives the appearance of files being lost. This commit adds a feature flag, large_microzap, that must be enabled for microZAPs to grow beyond 128KiB, and which will be activated the first time that occurs. This feature is later checked when generating the stream and if active, the send operation will abort unless --large-block has also been requested. Changing the limit still requires zap_micro_max_size to be changed. The state of this flag effectively sets the upper value for this tuneable, that is, if the feature is disabled, the tuneable will be clamped to 128KiB. A stream flag is also added to ensure that the receiver also activates its own feature flag upon receiving the stream. This is not strictly necessary to _use_ the received microZAP, since it doesn't care how large its block is, but it is required to send the microZAP object on, otherwise the original problem occurs again. Because it's difficult to reliably distinguish a microZAP from a fatZAP from outside the ZAP code, and because it seems unlikely that most users are affected (a fairly niche tuneable combined with what should be an uncommon use of send), and for the sake of expediency, this change activates the feature the first time a microZAP grows to use a large block, and is never deactivated after that. This can be improved in the future. This commit changes nothing for existing pools that already have large microZAPs. The feature will not be retroactively applied, but will be activated the next time a microZAP grows past the limit. Don't use large_blocks feature for enable/disable tests. The large_microzap depends on large_blocks, so it gets enabled as a dependency, breaking the test. Instead use feature "longname", which has the exact same feature characteristics. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16593	2024-10-02 20:47:11 -07:00
Brian Behlendorf	d34d4f97a8	snapdir: add 'disabled' value to make .zfs inaccessible In some environments, just making the .zfs control dir hidden from sight might not be enough. In particular, the following scenarios might warrant not allowing access at all: - old snapshots with wrong permissions/ownership - old snapshots with exploitable setuid/setgid binaries - old snapshots with sensitive contents Introducing a new 'disabled' value that not only hides the control dir, but prevents access to its contents by returning ENOENT solves all of the above. The new property value takes advantage of 'iuv' semantics ("ignore unknown value") to automatically fall back to the old default value when a pool is accessed by an older version of ZFS that doesn't yet know about 'disabled' semantics. I think that technically the zfs_dirlook change is enough to prevent access, but preventing lookups and dir entries in an already opened .zfs handle might also be a good idea to prevent races when modifying the property at runtime. Add zfs_snapshot_no_setuid parameter to control whether automatically mounted snapshots have the setuid mount option set or not. this could be considered a partial fix for one of the scenarios mentioned in desired. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Co-authored-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #3963 Closes #16587	2024-10-02 09:12:02 -07:00
Sanjeev Bagewadi	20232ecfaa	Support for longnames for files/directories (Linux part) This patch adds the ability for zfs to support file/dir name up to 1023 bytes. This number is chosen so we can support up to 255 4-byte characters. This new feature is represented by the new feature flag feature@longname. A new dataset property "longname" is also introduced to toggle longname support for each dataset individually. This property can be disabled, even if it contains longname files. In such case, new file cannot be created with longname but existing longname files can still be looked up. Note that, to my knowledge native Linux filesystems don't support name longer than 255 bytes. So there might be programs not able to work with longname. Note that NFS server may needs to use exportfs_get_name to reconnect dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So NFS may not work when longname is enabled. Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even though we add code to support it here, it won't actually work. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15921	2024-10-01 13:40:27 -07:00
Sanjeev Bagewadi	3cf2bfa570	Allocate zap_attribute_t from kmem instead of stack This patch is preparatory work for long name feature. It changes all users of zap_attribute_t to allocate it from kmem instead of stack. It also make zap_attribute_t and zap_name_t structure variable length. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #15921	2024-10-01 13:39:08 -07:00
Rob Norris	7e957fde73	send/recv: open up additional stream feature flags The docs for drr_versioninfo have marked the top 32 bits as "reserved" since its introduction (illumos/illumos-gate@9e69d7d). There's no indication of why they're reserved, so it seems uncontroversial to make a lot more flags available. I'm keeping the top eight reserved, and explicitly calling them out as such, so we can extend the header further in the future if we run out of flags or want to do some kind of change that isn't about feature flags. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15454	2024-10-01 09:08:53 -07:00
Rob Norris	6f50f8e16b	zfs_log: add flex array fields to log record structs ZIL log record structs (lr_XX_t) are frequently allocated with extra space after the struct to carry variable-sized "payload" items. Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime bounds checking on memcpy() calls. Because these types had no indicator that they might use more space than their simple definition, __fortify_memcpy_chk will frequently complain about overruns eg: memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char )(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char )(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char )(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char )(lr + 1) + snamesize" at zfs_log.c:594 (size 0) memcpy: detected field-spanning write (size 7) of single field "lr + 1" at zfs_log.c:425 (size 0) memcpy: detected field-spanning write (size 9) of single field "(char )(lr + 1)" at zfs_log.c:593 (size 0) memcpy: detected field-spanning write (size 4) of single field "(char )(lr + 1) + snamesize" at zfs_log.c:594 (size 0) To fix this, this commit adds flex array fields to all lr_XX_t structs that require them, and then uses those fields to access that end-of-struct area rather than more complicated casts and pointer addition. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16501 Closes #16539	2024-09-27 09:18:11 -07:00
George Melikov	b32d48a625	ZLE compression: don't use BPE_PAYLOAD_SIZE ZLE compressor needs additional bytes to process d_len argument efficiently. Don't use BPE_PAYLOAD_SIZE as d_len with it before we rework zle compressor somehow. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: George Melikov <mail@gmelikov.ru> Closes #9416	2024-09-19 17:24:51 -07:00

1 2 3 4 5 ...

1378 Commits