mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 02:44:41 +03:00

Author	SHA1	Message	Date
Shengqi Chen	bce36e21ca	module/icp/asm-arm/sha2: auto detect __ARM_ARCH This patch uses __ARM_ARCH set by compiler (both GCC and Clang have this) whenever possible instead of hardcoding it to 7. This change allows code to compile on earlier ARM architectures such as armv5te. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Closes #15557	2024-08-26 15:10:16 -07:00
Ameer Hamza	07f0465742	linux/zvol_os.c: cleanup limits for non-blk mq case Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-22 15:43:34 -07:00
Ameer Hamza	0f9457d1dd	linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernel In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-22 15:43:34 -07:00
Justin Gottula	859f906a4b	Fix null ptr deref when renaming a zvol with snaps and snapdev=visible (#16316 ) If a zvol is renamed, and it has one or more snapshots, and snapdev=visible is true for the zvol, then the rename causes a kernel null pointer dereference error. This has the effect (on Linux, anyway) of killing the z_zvol taskq kthread, with locks still held; which in turn causes a variety of zvol-related operations afterward to hang indefinitely (such as udev workers, among other things). The problem occurs because of an oversight in #15486 (`e36ff84c33`). As documented in dataset_kstats_create, some datasets may not actually have kstats allocated for them; and at least at the present time, this is true for snapshots. In practical terms, this means that for snapshots, dk->dk_kstats will be NULL. The dataset_kstats_rename function introduced in the patch above does not first check whether dk->dk_kstats is NULL before proceeding, unlike e.g. the nearby dataset_kstats_update_* functions. In the very particular circumstance in which a zvol is renamed, AND that zvol has one or more snapshots, AND that zvol also has snapdev=visible, zvol_rename_minors_impl will loop over not just the zvol dataset itself, but each of the zvol's snapshots as well, so that their device nodes will be renamed as well. This results in dataset_kstats_create being called for snapshots, where, as we've established, dk->dk_kstats is NULL. Fix this by simply adding a NULL check before doing anything in dataset_kstats_rename. This still allows the dataset_name kstat value for the zvol to be updated (as was the intent of the original patch), and merely blocks attempts by the code to act upon the zvol's non-kstat-having snapshots. If at some future time, kstats are added for snapshots, then things should work as intended in that case as well. Signed-off-by: Justin Gottula <justin@jgottula.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Alan Somers <asomers@gmail.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-22 15:42:49 -07:00
Tony Hutter	84a9861536	Linux 6.10 compat: Fix zvol NULL pointer deference zvol_alloc_non_blk_mq()->blk_queue_set_write_cache() needs the disk queue setup to prevent a NULL pointer deference. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16453	2024-08-22 15:42:49 -07:00
Tony Hutter	9d64d1bfad	Linux 6.10 compat: fix rpm-kmod and builtin The 6.10 kernel broke our rpm-kmod builds. The 6.10 kernel really wants the source files in the same directory as the object files. This workaround makes rpm-kmod work again. It also updates the builtin kernel codepath to work correctly with 6.10. See kernel commits: b1992c3772e6 kbuild: use $(src) instead of $(srctree)/$(src) for source directory 9a0ebe5011f4 kbuild: use $(obj)/ instead of $(src)/ for common pattern rules Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16439 Closes #16450	2024-08-22 15:42:49 -07:00
Rob Norris	8479a45abe	Linux 6.11: avoid passing "end" sentinel to register_sysctl() Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	8156099cf2	Linux 6.11: add compat macro for page_mapping() Since the change to folios it has just been a wrapper anyway. Linux has removed their wrapper, so we add one. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	11ad6124c3	Linux 6.11: add more queue_limit fields with removed setters These fields are very old, so no detection necessary; we just move them into the limit setup functions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	11de432c8b	Linux 6.11: IO stats is now a queue feature flag Apply them with with the rest of the settings. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	464747ffd3	Linux 6.11: first arg to proc_handler is now const Detect it, and use a macro to make sure we always match the prototype. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	4fa84563b8	Linux 6.11: enable queue flush through queue limits In 6.11 struct queue_limits gains a 'features' field, where, among other things, flush and write-cache are enabled. Detect it and use it. Along the way, the blk_queue_set_write_cache() compat wrapper gets a little cleanup. Since both flags are alway set together, its now a single bool. Also the very very ancient version that sets q->flush_flags directly couldn't actually turn it off, so I've fixed that. Not that we use it, but still. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Mark Johnston	3a36797ad6	FreeBSD: Fix RLIMIT_FSIZE handling for block cloning ZFS implements copy_file_range(2) using block cloning when possible. This implementation must respect the RLIMIT_FSIZE limit. zfs_clone_range() already checks the limit, so it is safe to remove this check in zfs_freebsd_copy_file_range(). Moreover, the removed check produces false positives: the length passed to copy_file_range(2) may be larger than the input file size; as the man page notes, "for best performance, call copy_file_range() with the largest len value possible." In particular, some existing code passes SSIZE_MAX there. The check in zfs_clone_range() clamps the length to the input file's size before checking, but the removed check uses the caller supplied length, so something like $ echo a > /tmp/foo $ limits -f 1024 cat /tmp/foo > /tmp/bar fails because FreeBSD's cat(1) uses copy_file_range(2) in the manner described above. Reported-by: Philip Paeps <philip@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-22 15:17:21 -07:00
c1ick	ac6500389b	zfs: add bounds checking to zil_parse (#16308 ) Make sure log record don't stray beyond valid memory region. There is a lack of verification of the space occupied by fixed members of lr_t in the zil_parse. We can create a crafted image to trigger an out of bounds read by following these steps: 1) Do some file operations and reboot to simulate abnormal exit without umount 2) zil_chain.zc_nused: 0x1000 3) First lr_t lr_t.lrc_txtype: 0x0 lr_t.lrc_reclen: 0x1000-0xb8-0x1 lr_t.lrc_txg: 0x0 lr_t.lrc_seq: 0x1 4) Update checksum in zil_chain.zc_eck Fix: Add some checks to make sure the remaining bytes are large enough to hold an log record. Signed-off-by: XDTG <click1799@163.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-08-22 15:12:54 -07:00
Rob Norris	1f055436f3	linux/zvol_os: fix SET_ERROR with negative return codes SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16364	2024-08-22 15:11:44 -07:00
Tony Hutter	6f27c4cadd	[2.2.5-only] Make 'rmmod zfs' work after zfs-2.2.4 (#16406 ) `db65272ae` was added to zfs-2.2.4 to stub in the VDEV_PROP_RAIDZ_EXPANDING enum without adding the RAIDz expansion feature. This was needed to provide the right enum count for when the VDEV_PROP_SLOW_IO proprieties got added. This had the unfortunate side effect of breaking module removal though. Specifically, with the VDEV_PROP_RAIDZ_EXPANDING stub added, the module would correctly omit making kobjects for the RAIDz expansion vdev property, but then would try to blindly remove its non-existent kobjects during module unload. This commit fixes the issue by checking for an uninitialized kobject. Fixes: #16249 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>	2024-08-02 18:03:09 -07:00
Tony Hutter	b5835ed137	Linux 6.9: Fix UBSAN errors in sa.c (#16380 ) This is a follow-on to `156a64161b` that ignores UBSAN errors in sa.c. Thank you @thwalker3 for the fix. Original-patch-by: @thwalker3 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16278 Closes #16330 Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-23 17:13:57 -07:00
Chunwei Chen	ef08cb26da	Fix long_free_dirty accounting for small files (#16264 ) For files smaller than recordsize, it's most likely that they don't have L1 blocks. However, current calculation will always return at least 1 L1 block. In this change, we check dnode level to figure out if it has L1 blocks or not, and return 0 if it doesn't. This will reduce the chance of unnecessary throttling when deleting a large number of small files. Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-23 12:02:10 -07:00
Rob Norris	4d2f7f9839	vdev_open: clear async fault flag after reopen After `c3f2f1aa2`, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev. In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again. The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com>	2024-07-17 14:54:47 -07:00
Alexander Motin	13ccbbb47a	Some improvements to metaslabs eviction - Add old eviction for special and dedup metaslab classes. Those vdevs may be potentially big and fragmented with large metaslabs, while their asynchronous write pattern is not really different from normal class. It seems an omission to not evict old metaslabs from them. - If we have metaslab preload enabled, which means we are not too low on memory, do not evict active metaslabs even if they are not used for some time. Eviction of active metaslabs means we won't be able to write anything until we load them, that may take some time, that is straight opposite to metaslab preload goals. For small systems the memory saving should be less important after recent reduction in number of allocators and so open metaslabs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16214	2024-07-17 14:54:47 -07:00
Alexander Motin	ba3c7692cd	Destroy ARC buffer in case of fill error In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216	2024-07-17 14:54:47 -07:00
Rob N	27cc6df760	Use memset to zero stack allocations containing unions C99 6.7.8.17 says that when an undesignated initialiser is used, only the first element of a union is initialised. If the first element is not the largest within the union, how the remaining space is initialised is up to the compiler. GCC extends the initialiser to the entire union, while Clang treats the remainder as padding, and so initialises according to whatever automatic/implicit initialisation rules are currently active. When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN, -ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag sets the policy for automatic/implicit initialisation of variables on the stack. Taken together, this means that when compiling under CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only zero the first element in a union, and the rest will be filled with a pattern. This is significant for aes_ctx_t, which in aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero, but then used as a gcm_ctx_t, which is the fifth element in the union, and thus gets pattern initialisation. Later, it's assumed to be zero, resulting in a hang. As confusing and undiscoverable as it is, by the spec, we are at fault when we initialise a structure containing a union with the zero initializer. As such, this commit replaces these uses with an explicit memset(0). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16135 Closes #16206	2024-07-17 14:54:47 -07:00
Rob N	fa2480f5b3	abd_iter_page: rework to handle multipage scatterlists Previously, abd_iter_page() would assume that every scatterlist would contain a single page (compound or no), because that's all we ever create in abd_alloc_chunks(). However, scatterlists can contain multiple pages of arbitrary provenance, and if we get one of those, we'd get all the math wrong. This reworks things to handle multiple pages in a scatterlist, by properly finding the right page within it for the given offset, and understanding better where the end of the page is and not crossing it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16108	2024-07-17 14:54:46 -07:00
Rob Norris	7d8e2a7f73	Linux 5.16: use bdev_nr_bytes() to get device capacity This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:40:29 -07:00
Rob Norris	3ea3649755	Linux 6.10: work harder to avoid kmem_cache_alloc reuse Linux 6.10 change kmem_cache_alloc to be a macro, rather than a function, such that the old #undef for it in spl-kmem-cache.c would remove its definition completely, breaking the build. This inverts the model used before. Rather than always defining the kmem_cache_* macro, then undefining then inside spl-kmem-cache.c, instead we make a special tag to indicate we're currently inside spl-kmem-cache.c, and not defining those in macros in the first place, so we can use the kernel-supplied kmem_cache_* functions to implement spl_kmem_cache_*, as we expect. For all other callers, we create the macros as normal and remove access to the kernel's own conflicting names. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:33:46 -07:00
Rob Norris	0342c4a6b2	Linux 6.10: rework queue limits setup Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:33:37 -07:00
Tony Hutter	d7bf0e5259	Linux 6.9: Fix UBSAN errors in zap_micro.c You can use the UBSAN_SANITIZE_* Kbuild options to exclude certain kernel objects from the UBSAN checks. We previously excluded zap_micro.o with: UBSAN_SANITIZE_zap_micro.o := n For some reason that didn't work for the 6.9 kernel, which wants us to use: UBSAN_SANITIZE_zfs/zap_micro.o := n Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16278 Closes #16330	2024-07-16 15:33:31 -07:00
Tony Hutter	c24a039042	Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282 ) The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>	2024-07-16 15:33:23 -07:00
George Amanakis	54ef0fdf60	head_errlog: fix use-after-free In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16272 Closes #16273	2024-07-15 09:07:33 -07:00
George Amanakis	2eab4f7b39	Fix assertion in Persistent L2ARC At the end of l2arc_evict() fix an assertion in the case that l2ad_hand + distance == l2ad_end. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16202 Closes #16207	2024-05-29 13:35:14 -07:00
Alexander Motin	4c0fbd8d6d	FreeBSD: Add zfs_link_create() error handling Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138	2024-05-29 08:54:19 -07:00
Alexander Motin	fa4b1a404e	ZAP: Fix leaf references on zap_expand_leaf() errors Depending on kind of error zap_expand_leaf() may return with or without valid leaf reference held. Make sure it returns NULL if due to error it has no leaf to return. Make its callers to check the returned leaf pointer, and release the leaf if it is not NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #12366 Closes #16159	2024-05-29 08:54:19 -07:00
Alexander Motin	4c484d66b7	Fix ZIL clone records for legacy holes Previous code overengineered cloned range calculation by using BP_GET_LSIZE(). The problem is that legacy holes don't have the logical size, so result will be wrong. But we also don't need to look on every block size, since they all must be identical. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16165	2024-05-29 08:54:19 -07:00
Alexander Motin	41f2a9c81f	Fix scn_queue races on very old pools Code for pools before version 11 uses dmu_objset_find_dp() to scan for children datasets/clones. It calls enqueue_clones_cb() and enqueue_cb() callbacks in parallel from multiple taskq threads. It ends up bad for scan_ds_queue_insert(), corrupting scn_queue AVL-tree. Fix it by introducing a mutex to protect those two scan_ds_queue_insert() calls. All other calls are done from the sync thread and so serialized. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16162	2024-05-29 08:54:19 -07:00
Alexander Motin	6724746596	Slightly improve dnode hash As I understand just for being less predictable dnode hash includes 8 bits of objset pointer, starting at 6. But since objset_t is more than 1KB in size, its allocations are likely aligned to 2KB, that means 11 lower bits provide no entropy. Just take the 8 bits starting from 11. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16131	2024-05-29 08:54:19 -07:00
Alexander Motin	938d1588eb	Make more taskq parameters writable There is no reason for these module parameters to be read-only. Being modified they just apply on next pool import/creation, that is useful for testing different values. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16118	2024-05-29 08:54:19 -07:00
Alexander Motin	0f1e8ba2f8	L2ARC: Cleanup buffer re-compression When compressed ARC is disabled, we may have to re-compress when writing into L2ARC. If doing so we can't fit it into the original physical size, we should just fail immediately, since even if it may still fit into allocation size, its checksum will never match. While there, refactor the code similar to other compression places without using abd_return_buf_copy(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16038	2024-05-29 08:54:19 -07:00
Alexander Motin	b474dfad0d	Refactor dbuf_read() for safer decryption In dbuf_read_verify_dnode_crypt(): - We don't need original dbuf locked there. Instead take a lock on a dnode dbuf, that is actually manipulated. - Block decryption for a dnode dbuf if it is currently being written. ARC hash lock does not protect anonymous buffers, so arc_untransform() is unsafe when used on buffers being written, that may happen in case of encrypted dnode buffers, since they are not copied by dbuf_dirty()/dbuf_hold_copy(). In dbuf_read(): - If the buffer is in flight, recheck its compression/encryption status after it is cached, since it may need arc_untransform(). Tested-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16104	2024-05-29 08:54:19 -07:00
chenqiuhao1997	9edf6af4ae	Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com> Closes #15940	2024-05-13 10:27:38 -05:00
Alan Somers	3d4d61988a	Fix updating the zvol_htable when renaming a zvol When renaming a zvol, insert it into zvol_htable using the new name, not the old name. Otherwise some operations won't work. For example, "zfs set volsize" while the zvol is open. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org> Closes #16127 Closes #16128	2024-04-30 10:01:15 -07:00
Brian Behlendorf	61f3638a34	Add prefetch property ZFS prefetch is currently governed by the zfs_prefetch_disable tunable. However, this is a module-wide settings - if a specific dataset benefits from prefetch, while others have issue with it, an optimal solution does not exists. This commit introduce the "prefetch" tri-state property, which enable granular control (at dataset/volume level) for prefetching. This patch does not remove the zfs_prefetch_disable, which remains a system-wide switch for enable/disable prefetch. However, to avoid duplication, it would be preferable to deprecate and then remove the module tunable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Co-authored-by: Gionatan Danti <g.danti@assyoma.it> Closes #15237 Closes #15436	2024-04-30 10:01:15 -07:00
Don Brady	706307445e	vdev probe to slow disk can stall mmp write checker Simplify vdev probes in the zio_vdev_io_done context to avoid holding the spa config lock for a long duration. Also allow zpool clear if no evidence of another host is using the pool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15839	2024-04-30 10:01:15 -07:00
Don Brady	ea3f7c12a9	Extend import_progress kstat with a notes field Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Closes #15539	2024-04-29 17:45:53 -07:00
George Wilson	6f323353d2	Add ashift validation when adding devices to a pool Currently, zpool add allows users to add top-level vdevs that have different ashifts but doing so prevents users from being able to perform a top-level vdev removal. Often times consumers may not realize that they have mismatched ashifts until the top-level removal fails. This feature adds ashift validation to the zpool add command and will fail the operation if the sector size of the specified vdev does not match the existing pool. This behavior can be disabled by using the -f flag. In addition, new flags have been added to provide fine-grained control to disable specific checks. These flags are: --allow-in-use --allow-ashift-mismatch --allow-replicaton-mismatch The force flag will disable all of these checks. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Maybee <mmaybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #15509	2024-04-29 13:50:05 -07:00
Dag-Erling Smørgrav	5972bb856c	Use ASSERT0P() to check that a pointer is NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Closes #15225	2024-04-29 13:50:05 -07:00
Tony Hutter	ef3fea63eb	GCC: Fixes for gcc 14 on Fedora 40 - Workaround dangling pointer in uu_list.c (#16124) - Fix calloc() transposed arguments in zpool_vdev_os.c - Make some temp variables unsigned to prevent triggering a '-Werror=alloc-size-larger-than' error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16124 Closes #16125	2024-04-29 13:50:05 -07:00
Tino Reichardt	16c223eec9	Do no use .cfi_negate_ra_state within the assembly on Arm64 Compiling openzfs on aarch64 with gcc-8 and gcc-9 is failing currently. See issue #14965 for deeper context. On platforms without pointer authentication, .cfi_negate_ra_state can be defined to a no-op: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/aarch64-tdep.c#l1413 I have tested this on Arm64 FreeBSD 13.2 and AlmaLinux-8. Reviewed-by: Andrew Turner <andrew.turner4@arm.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14965 Closes #15784	2024-04-29 13:50:05 -07:00
Andrew Turner	7aaf6ce9d8	Add the BTI elf note to the AArch64 SHA2 assembly On ELF platforms there is a note to specify when an application or library supports BTI. When linking one of these the linker needs all input object files to have the note. If not it will not include it in the output file. Normally the compiler would generate it, but for assembly files we need to do it our selves. Add the note to the aarch64 sha256 and sha512 assembly files. Tested by building with BTI enabled and using the -zbti-report=error flag to lld that makes it an error if the note is missing. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Turner <andrew.turner4@arm.com> Closes #16086	2024-04-29 13:50:05 -07:00
Pavel Snajdr	531572b590	Fix panics when truncating/deleting files There's an union in dbuf_dirty_record_t; dr_brtwrite could evaluate to B_TRUE if the dirty record is of another type than dl. Adding more explicit dr type check before trying to access dr_brtwrite. Fixes two similar panics: [ 1373.806119] VERIFY0(db->db_level) failed (0 == 1) [ 1373.807232] PANIC at dbuf.c:2549:dbuf_undirty() [ 1373.814979] dump_stack_lvl+0x71/0x90 [ 1373.815799] spl_panic+0xd3/0x100 [spl] [ 1373.827709] dbuf_undirty+0x62a/0x970 [zfs] [ 1373.829204] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1373.831010] dnode_free_range+0x532/0x1220 [zfs] [ 1373.833922] dmu_free_long_range+0x4e0/0x930 [zfs] [ 1373.835277] zfs_trunc+0x75/0x1e0 [zfs] [ 1373.837958] zfs_freesp+0x9b/0x470 [zfs] [ 1373.847236] zfs_setattr+0x161a/0x3500 [zfs] [ 1373.855267] zpl_setattr+0x125/0x320 [zfs] [ 1373.856725] notify_change+0x1ee/0x4a0 [ 1373.859207] do_truncate+0x7f/0xd0 [ 1373.859968] do_sys_ftruncate+0x28e/0x2e0 [ 1373.860962] do_syscall_64+0x38/0x90 [ 1373.861751] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1822.381337] VERIFY0(db->db_level) failed (0 == 1) [ 1822.382376] PANIC at dbuf.c:2549:dbuf_undirty() [ 1822.389232] dump_stack_lvl+0x71/0x90 [ 1822.389920] spl_panic+0xd3/0x100 [spl] [ 1822.399567] dbuf_undirty+0x62a/0x970 [zfs] [ 1822.400583] dmu_buf_will_dirty_impl+0x1e9/0x5b0 [zfs] [ 1822.401752] dnode_free_range+0x532/0x1220 [zfs] [ 1822.402841] dmu_object_free+0x74/0x120 [zfs] [ 1822.403869] zfs_znode_delete+0x75/0x120 [zfs] [ 1822.404906] zfs_rmnode+0x3f6/0x7f0 [zfs] [ 1822.405870] zfs_inactive+0xa3/0x610 [zfs] [ 1822.407803] zpl_evict_inode+0x3e/0x90 [zfs] [ 1822.408831] evict+0xc1/0x1c0 [ 1822.409387] do_unlinkat+0x147/0x300 [ 1822.410060] __x64_sys_unlinkat+0x33/0x60 [ 1822.410802] do_syscall_64+0x38/0x90 [ 1822.411458] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #15983	2024-04-29 13:50:05 -07:00
Don Brady	c1c26a77ff	Add slow disk diagnosis to ZED Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #15469	2024-04-29 13:50:05 -07:00

1 2 3 4 5 ...

4540 Commits