mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Rob Norris	8bc4c13fac	config: remove HAVE_IO_SCHEDULE_TIMEOUT Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:58 -08:00
Rob Norris	22131b9a6e	config: remove HAVE_INODE_SET_FLAGS Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:58 -08:00
Rob Norris	3c370f09fa	config: remove HAVE_GENERIC_WRITE_CHECKS_KIOCB Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:58 -08:00
Rob Norris	b6223a572e	config: remove HAVE_FSYNC_RANGE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:58 -08:00
Rob Norris	ccb59e31e4	config: remove HAVE_FALLOC_FL_ZERO_RANGE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:58 -08:00
Rob Norris	bfc7e03ec6	config: remove HAVE_ENCODE_FH_WITH_INODE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:57 -08:00
Rob Norris	984a836986	config: remove HAVE_D_PRUNE_ALIASES Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:57 -08:00
Rob Norris	a054829579	config: remove HAVE_DIRTY_INODE_WITH_FLAGS Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:57 -08:00
Rob Norris	8a33624ec6	config: remove HAVE_1ARG_BIO_END_IO_T Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-11-15 10:14:57 -08:00
Jason Lee	4f0a8eb7c1	Use kmap_local_page instead of kmap_atomic (#16329 ) Changed zfs_k(un)map_atomic to zfs_k(un)map_local Signed-off-by: Jason Lee <jasonlee@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov>	2024-11-15 10:14:57 -08:00
Rob Norris	cf80a803d5	zvol: ensure device minors are properly cleaned up Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14872 Closes #16364	2024-11-15 10:14:50 -08:00
Jitendra Patidar	f6fce8e12a	Fix issig() to check signal_pending after dequeue SIGSTOP/SIGTSTP When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0. But process could still have another signal pending after dequeue. So, after dequeue, check and return 1, if signal_pending. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #16464	2024-11-14 15:20:10 -08:00
Pawel Jakub Dawidek	a6198f34bd	Simplify issig(). We always call it twice with JUSTLOOKING and then FORREAL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #16225	2024-11-14 15:20:10 -08:00
Alexander Motin	ca95fa3531	Linux: Report reclaimable memory to kernel as such (#16385 ) Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to mark memory allocations that can be freed via shinker calls. It should allow kernel to tune and group such allocations for lower memory fragmentation and better reclamation under pressure. This patch marks as reclaimable most of ARC memory, directly evictable via ZFS shrinker, plus also dnode/znode/sa memory, indirectly evictable via kernel's superblock shrinker. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com>	2024-11-14 15:20:06 -08:00
Daniel Perry	05d6f621a9	Replace usage of schedule_timeout with schedule_timeout_interruptible (#16150 ) This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Daniel Perry <dtperry@amazon.com> Closes #16150	2024-11-13 10:17:08 -08:00
Rob Norris	f4e66db401	vdev_disk: move abd return and free off the interrupt handler Freeing an ABD can take sleeping locks to update various stats. We aren't allowed to sleep on an interrupt handler. So, move the free off to the io_done callback. We should never have been freeing things in the interrupt handler, but we got away with it because we were usually freeing a linear ABD, which at most is returning two objects to a cache and never sleeping. Scatter ABDs can be used now, and those have more complex locking. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16687	2024-11-06 10:07:23 -08:00
Rob Norris	f237b8e2a4	vdev_disk: try harder to ensure IO alignment rules It seems out our notion of "properly" aligned IO was incomplete. In particular, dm-crypt does its own splitting, and assumes that a logical block will never cross an order-0 page boundary (ie, the physical page size, not compound size). This effectively means that it needs to be possible to split a BIO at any page or block size boundary and have it work correctly. This updates the alignment check function to enforce these rules (to the extent possible). Our response to misaligned data is to make some new allocation that is properly aligned, and copy the data into it. It turns out that linearising (via abd_borrow_buf()) is not enough, because we allocate eg 4K blocks from a general purpose slab, and so may receive (or already have) a 4K block that crosses pages. So instead, we allocate a new ABD, which is guaranteed to be aligned properly to block sizes, and then copy everything into it, and back out on the way back. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16687 #16631 #15646 #15533 #14533 (cherry picked from commit `63bafe60ec`)	2024-11-06 10:06:30 -08:00
Dimitry Andric	73b3e8acef	Fix gcc uninitialized warning in FreeBSD zio_crypt.c In FreeBSD's `zio_do_crypt_data()`, ensure that two `struct uio` variables are cleared before copying data out of them. This avoids accessing garbage data, and fixes gcc `-Wuninitialized` warnings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Toomas Soome <tsoome@me.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #16688	2024-11-06 10:06:29 -08:00
Umer Saleem	308d04ac37	Fix inconsistent mount options for ZFS root While mounting ZFS root during boot on Linux distributions from initrd, mount from busybox is effectively used which executes mount system call directly. This skips the ZFS helper mount.zfs, which checks and enables the mount options as specified in dataset properties. As a result, datasets mounted during boot from initrd do not have correct mount options as specified in ZFS dataset properties. There has been an attempt to use mount.zfs in zfs initrd script, responsible for mounting the ZFS root filesystem (PR#13305). This was later reverted (PR#14908) after discovering that using mount.zfs breaks mounting of snapshots on root (/) and other child datasets of root have the same issue (Issue#9461). This happens because switching from busybox mount to mount.zfs correctly parses the mount options but also adds 'mntpoint=/root' to the mount options, which is then prepended to the snapshot mountpoint in '.zfs/snapshot'. '/root' is the directory on Debian with initramfs-tools where root filesystem is mounted before pivot_root. When Linux runtime is reached, trying to access the snapshots on root results in automounting the snapshot on '/root/.zfs/*', which fails. This commit attempts to fix the automounting of snapshots on root, while using mount.zfs in initrd script. Since the mountpoint of dataset is stored in vfs_mntpoint field, we can check if current mountpoint of dataset and vfs_mntpoint are same or not. If they are not same, reset the vfs_mntpoint field with current mountpoint. This fixes the mountpoints of root dataset and children in respective vfs_mntpoint fields when we try to access the snapshots of root dataset or its children. With correct mountpoint for root dataset and children stored in vfs_mntpoint, all snapshots of root dataset are mounted correctly and become accessible. This fix will come into play only if current process, that is trying to access the snapshots is not in chroot context. The Linux kernel API that is used to convert struct path into char format (d_path), returns the complete path for given struct path. It works in chroot environment as well and returns the correct path from original filesystem root. However d_path fails to return the complete path if any directory from original root filesystem is mounted using --bind flag or --rbind flag in chroot environment. In this case, if we try to access the snapshot from outside the chroot environment, d_path returns the path correctly, i.e. it returns the correct path to the directory that is mounted with --bind flag. However inside the chroot environment, it only returns the path inside chroot. For now, there is not a better way in my understanding that gives the complete path in char format and handles the case where directories from root filesystem are mounted with --bind or --rbind on another path which user will later chroot into. So this fix gets enabled if current process trying to access the snapshot is not in chroot context. With the snapshots issue fixed for root filesystem, using mount.zfs in ZFS initrd script, mounts the datasets with correct mount options. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #16646	2024-11-05 15:43:53 -08:00
Alexander Motin	21c40e6d9e	FreeBSD: Sync taskq_cancel_id() returns with Linux Couple places in the code depend on 0 returned only if the task was actually cancelled. Doing otherwise could lead to extra references being dropped. The race could be small, but I believe CI hit it from time to time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16565	2024-11-05 15:43:52 -08:00
Alan Somers	bc0d89bfc1	Fix an uninitialized data access (#16511 ) zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack zfs_acl_chmod only partially initializes it. KMSAN reported that this memory remained uninitialized at the point when it was read by lzjb_compress, which suggests a possible kernel memory disclosure bug. The full KMSAN warning may be found in the PR. https://github.com/openzfs/zfs/pull/16511 Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-11-05 15:43:52 -08:00
Ameer Hamza	c60df6a801	linux/zvol_os: fix zvol queue limits initialization zvol queue limits initialization depends on `zv_volblocksize`, but it is initialized later, leading to several limits being initialized with incorrect values, including `max_discard_*` limits. This also causes `blkdiscard` command to consistently fail, as `blk_ioctl_discard` reads `bdev_max_discard_sectors()` limits as 0, leading to failure. The fix is straightforward: initialize `zv->zv_volblocksize` early, before setting the queue limits. This PR should fix `zvol/zvol_misc/zvol_misc_trim` failure on recent PRs, as the test case issues `blkdiscard` for a zvol. Additionally, `zvol_misc_trim` was recently enabled in `6c7d41a`, which is why the issue wasn't identified earlier. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16454	2024-08-26 15:10:16 -07:00
Rob Norris	d8fa32a79d	linux/zvol_os: tidy and document queue limit/config setup It gets hairier again in Linux 6.11, so I want some actual theory of operation laid out for next time. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-26 15:10:16 -07:00
Ameer Hamza	07f0465742	linux/zvol_os.c: cleanup limits for non-blk mq case Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-22 15:43:34 -07:00
Ameer Hamza	0f9457d1dd	linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernel In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-22 15:43:34 -07:00
Tony Hutter	84a9861536	Linux 6.10 compat: Fix zvol NULL pointer deference zvol_alloc_non_blk_mq()->blk_queue_set_write_cache() needs the disk queue setup to prevent a NULL pointer deference. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16453	2024-08-22 15:42:49 -07:00
Rob Norris	8479a45abe	Linux 6.11: avoid passing "end" sentinel to register_sysctl() Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	8156099cf2	Linux 6.11: add compat macro for page_mapping() Since the change to folios it has just been a wrapper anyway. Linux has removed their wrapper, so we add one. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	11ad6124c3	Linux 6.11: add more queue_limit fields with removed setters These fields are very old, so no detection necessary; we just move them into the limit setup functions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	11de432c8b	Linux 6.11: IO stats is now a queue feature flag Apply them with with the rest of the settings. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	464747ffd3	Linux 6.11: first arg to proc_handler is now const Detect it, and use a macro to make sure we always match the prototype. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Rob Norris	4fa84563b8	Linux 6.11: enable queue flush through queue limits In 6.11 struct queue_limits gains a 'features' field, where, among other things, flush and write-cache are enabled. Detect it and use it. Along the way, the blk_queue_set_write_cache() compat wrapper gets a little cleanup. Since both flags are alway set together, its now a single bool. Also the very very ancient version that sets q->flush_flags directly couldn't actually turn it off, so I've fixed that. Not that we use it, but still. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-22 15:42:14 -07:00
Mark Johnston	3a36797ad6	FreeBSD: Fix RLIMIT_FSIZE handling for block cloning ZFS implements copy_file_range(2) using block cloning when possible. This implementation must respect the RLIMIT_FSIZE limit. zfs_clone_range() already checks the limit, so it is safe to remove this check in zfs_freebsd_copy_file_range(). Moreover, the removed check produces false positives: the length passed to copy_file_range(2) may be larger than the input file size; as the man page notes, "for best performance, call copy_file_range() with the largest len value possible." In particular, some existing code passes SSIZE_MAX there. The check in zfs_clone_range() clamps the length to the input file's size before checking, but the removed check uses the caller supplied length, so something like $ echo a > /tmp/foo $ limits -f 1024 cat /tmp/foo > /tmp/bar fails because FreeBSD's cat(1) uses copy_file_range(2) in the manner described above. Reported-by: Philip Paeps <philip@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-22 15:17:21 -07:00
Rob Norris	1f055436f3	linux/zvol_os: fix SET_ERROR with negative return codes SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16364	2024-08-22 15:11:44 -07:00
Tony Hutter	6f27c4cadd	[2.2.5-only] Make 'rmmod zfs' work after zfs-2.2.4 (#16406 ) `db65272ae` was added to zfs-2.2.4 to stub in the VDEV_PROP_RAIDZ_EXPANDING enum without adding the RAIDz expansion feature. This was needed to provide the right enum count for when the VDEV_PROP_SLOW_IO proprieties got added. This had the unfortunate side effect of breaking module removal though. Specifically, with the VDEV_PROP_RAIDZ_EXPANDING stub added, the module would correctly omit making kobjects for the RAIDz expansion vdev property, but then would try to blindly remove its non-existent kobjects during module unload. This commit fixes the issue by checking for an uninitialized kobject. Fixes: #16249 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>	2024-08-02 18:03:09 -07:00
Rob N	fa2480f5b3	abd_iter_page: rework to handle multipage scatterlists Previously, abd_iter_page() would assume that every scatterlist would contain a single page (compound or no), because that's all we ever create in abd_alloc_chunks(). However, scatterlists can contain multiple pages of arbitrary provenance, and if we get one of those, we'd get all the math wrong. This reworks things to handle multiple pages in a scatterlist, by properly finding the right page within it for the given offset, and understanding better where the end of the page is and not crossing it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16108	2024-07-17 14:54:46 -07:00
Rob Norris	7d8e2a7f73	Linux 5.16: use bdev_nr_bytes() to get device capacity This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:40:29 -07:00
Rob Norris	3ea3649755	Linux 6.10: work harder to avoid kmem_cache_alloc reuse Linux 6.10 change kmem_cache_alloc to be a macro, rather than a function, such that the old #undef for it in spl-kmem-cache.c would remove its definition completely, breaking the build. This inverts the model used before. Rather than always defining the kmem_cache_* macro, then undefining then inside spl-kmem-cache.c, instead we make a special tag to indicate we're currently inside spl-kmem-cache.c, and not defining those in macros in the first place, so we can use the kernel-supplied kmem_cache_* functions to implement spl_kmem_cache_*, as we expect. For all other callers, we create the macros as normal and remove access to the kernel's own conflicting names. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:33:46 -07:00
Rob Norris	0342c4a6b2	Linux 6.10: rework queue limits setup Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-16 15:33:37 -07:00
Tony Hutter	c24a039042	Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282 ) The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>	2024-07-16 15:33:23 -07:00
Alexander Motin	4c0fbd8d6d	FreeBSD: Add zfs_link_create() error handling Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138	2024-05-29 08:54:19 -07:00
chenqiuhao1997	9edf6af4ae	Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com> Closes #15940	2024-05-13 10:27:38 -05:00
Alan Somers	3d4d61988a	Fix updating the zvol_htable when renaming a zvol When renaming a zvol, insert it into zvol_htable using the new name, not the old name. Otherwise some operations won't work. For example, "zfs set volsize" while the zvol is open. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org> Closes #16127 Closes #16128	2024-04-30 10:01:15 -07:00
Rob N	5d859a2e22	xdr: header cleanup #16047 notes that include/os/freebsd/spl/rpc/xdr.h carried an (apparently) incompatible license. While looking into it, it seems that this file is actually unnecessary these days - FreeBSD's kernel XDR has XDR_CONTROL, xdrmem_control and XDR_GET_BYTES_AVAIL, while userspace has XDR_CONTROL and xdrmem_control, and our implementation of XDR_GET_BYTES_AVAIL for libspl works nicely with it. So this removes that file outright. To keep the includes in nvpair.c tidy, I've made a few small adjustments to the Linux headers. By definition, rpc/types.h provides bool_t and is included before rpc/xdr.h, so I've created rpc/types.h for Linux. This isn't necessary for userspace; both FreeBSD native and tirpc on Linux already have these headers set up correctly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16047 Closes #16051	2024-04-29 13:50:05 -07:00
Rob Norris	9a7ef02f4d	Linux 6.9 compat: blk_alloc_disk() now takes two args There's an extra nullable arg for queue limits. Detect it, and set it to NULL. Similar change for blk_mq_alloc_disk(), now three args, same treatment. Error return now has error encoded in the return, so detect with IS_ERR() and explicitly NULL our own return. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033	2024-04-22 09:23:23 -07:00
Rob Norris	3bd7cd06b7	Linux 6.9 compat: bdev handles are now struct file bdev_open_by_path() is replaced by bdev_file_open_by_path(), which returns a plain old struct file*. Release function is gone entirely; the regular file release function fput() will take care of the bdev specifics. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16027 Closes #16033	2024-04-22 09:23:23 -07:00
Rob N	b9c3040b10	vdev_disk: clean up spa/bdev mode conversion `43e8f6e37` introduced a subtle API misuse, in that it passed the output from vdev_bdev_mode() back into itself. Fortunately, the SPA_MODE_(READ\|WRITE) bit values exactly map to the FMODE_(READ\|WRITE) & BLK_OPEN_(READ\|WRITE) bit values, so it didn't result in a bug, but it was hard to read and understand, so I cleaned it up. In doing so, I noticed that the only call to vdev_bdev_mode() without the "exclusive" flag set was in that misuse, and actually, we never do a non-exclusive blkdev_get_by_path(). So I've just made exclusive be always-on. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #15995	2024-04-22 09:23:23 -07:00
Fabian-Gruenbichler	fa2cbd4007	zvols: prevent overflow of minor device numbers currently, the linux kernel allows 2^20 minor devices per major device number. ZFS reserves blocks of 2^4 minors per zvol: 1 for the zvol itself, the other 15 for the first partitions of that zvol. as a result, only 2^16 such blocks are available for use. there are no checks in place to avoid overflowing into the major device number when more than 2^16 zvols are allocated (with volmode=dev or default). instead of ignoring this limit, which comes with all sorts of weird knock-on effects, detect this situation and simply fail allocating the zvol block device early on. without this safeguard, the kernel will reject the attempt to create an already existing block device, but ZFS doesn't handle this error and gets confused about which zvol occupies which minor slot, potentially resulting in kernel NULL derefs and other issues later on. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #16006	2024-04-22 09:23:23 -07:00
Alexander Motin	793a2cff2a	Linux: Cleanup taskq threads spawn/exit This changes taskq_thread_should_stop() to limit maximum exit rate for idle threads to one per 5 seconds. I believe the previous one was broken, not allowing any thread exits for tasks arriving more than one at a time and so completing while others are running. Also while there: - Remove taskq_thread_spawn() calls on task allocation errors. - Remove extra taskq_thread_should_stop() call. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15873	2024-04-19 10:13:38 -07:00
Alexander Motin	fdd97e0093	Refactor dmu_prefetch(). - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15076	2024-04-19 10:13:38 -07:00
Rob N	3c5f354a8c	zvol_os: fix compile with blk-mq on Linux 4.x `99741bde5` accesses a cached blk-mq hardware context through the mq_hctx field of struct request. However, this field did not exist until 5.0. Before that, the private function blk_mq_map_queue() was used to dig it out of broader queue context. This commit detects this situation, and handles it with a poor-man's simulation of that function. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16069	2024-04-17 10:10:24 -07:00
Rob N	5c0fe099ec	zvol_os: fix build on Linux <3.13 `99741bde5` introduced zvol_num_taskqs, but put it behind the HAVE_BLK_MQ define, preventing builds on versions of Linux that don't have it (<3.13, incl EL7). Nothing about it seems dependent on blk-mq, so this just moves it out from behind that define and so fixes the build. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16062	2024-04-17 10:10:24 -07:00
Ameer Hamza	5fc134ff2f	zvol: use multiple taskq Currently, zvol uses a single taskq, resulting in throughput bottleneck under heavy load due to lock contention on the single taskq. This patch addresses the performance bottleneck under heavy load conditions by utilizing multiple taskqs, thus mitigating lock contention. The number of taskqs scale dynamically based on the available CPUs in the system, as illustrated below: taskq total cpus taskqs threads threads ------- ------- ------- ------- 1 1 32 32 2 1 32 32 4 1 32 32 8 2 16 32 16 3 11 33 32 5 7 35 64 8 8 64 128 11 12 132 256 16 16 256 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15992	2024-04-17 10:10:24 -07:00
Rob Norris	7ad2616d37	vdev_disk: fix alignment check when buffer has non-zero starting offset If a linear buffer spans multiple pages, and the first page has a non-zero starting offset, the checker would not include the offset, and so would think there was an alignment gap at the end of the first page, rather than at the start. That is, for a 16K buffer spread across five pages with an initial 512B offset: [.XXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXXX][XXXXXXX.] It would be interpreted as: [XXXXXXX.][XXXXXXXX]... And be rejected as misaligned. Since it's already a linear ABD, the "linearising" copy would just reuse the buffer as-is, and the second check would failing, tripping the VERIFY in vdev_disk_io_rw(). This commit fixes all this by including the offset in the check for end-of-page alignment. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `1bf649cb0a`)	2024-04-12 08:53:48 -07:00
Rob N	d0d9dccc61	vdev_disk: ensure trim errors are returned immediately After `08fd5ccc3`, the discard issuing code was organised such that if requesting an async discard or secure erase failed before the IO was issued (that is, calling __blkdev_issue_discard() returned an error), the failed zio would never be executed, resulting in txg_sync hanging forever waiting for IO to finish. This commit fixes that by immediately executing a failed zio on error. To handle the successful synchronous op case, we fake an async op by, when not using an asynchronous submission method, queuing the successful result zio as part of the discard handler. Since it was hard to understand the differences between discard and secure erase, and sync and async, across different kernel versions, I've commented and reorganised the code a bit to try and make everything more contained and linear. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `ba9f587a77`)	2024-04-11 12:25:40 -07:00
Rob Norris	28520cad25	vdev_disk: don't touch vbio after its handed off to the kernel After IO is unplugged, it may complete immediately and vbio_completion be called on interrupt context. That may interrupt or deschedule our task. If its the last bio, the vbio will be freed. Then, we get rescheduled, and try to write to freed memory through vbio->. This patch just removes the the cleanup, and the corresponding assert. These were leftovers from a previous iteration of vbio_submit() and were always "belt and suspenders" ops anyway, never strictly required. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc Reported-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> (cherry picked from commit `917ff75e95`)	2024-04-08 10:13:55 -07:00
Robert Evans	deb7a84231	Fix corruption caused by mmap flushing problems 1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Robert Evans <evansr@google.com> Closes #15933 Closes #16019	2024-03-29 17:10:04 -07:00
Rob Norris	eebf00bee9	vdev_disk: default to classic submission for 2.2.x We don't want to change to brand-new code in the middle of a stable series, but we want it available to test for people running into page splitting issues. This commits make zfs_vdev_disk_classic=1 the default, and updates the documentation to better explain what's going on. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.	2024-03-28 13:29:46 -07:00
Rob Norris	d0b3be763f	abd_iter_page: don't use compound heads on Linux <4.5 Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages in a compound page were refcounted separately. This means that using the head page without taking a reference to it could see it cleaned up later before we're finished with it. Specifically, bio_add_page() would take a reference, and drop its reference after the bio completion callback returns. If the zio is executed immediately from the completion callback, this is usually ok, as any data is referenced through the tail page referenced by the ABD, and so becomes "live" that way. If there's a delay in zio execution (high load, error injection), then the head page can be freed, along with any dirty flags or other indicators that the underlying memory is used. Later, when the zio completes and that memory is accessed, its either unmapped and an unhandled fault takes down the entire system, or it is mapped and we end up messing around in someone else's memory. Both of these are very bad. The solution on these older kernels is to take a reference to the head page when we use it, and release it when we're done. There's not really a sensible way under our current structure to do this; the "best" would be to keep a list of head page references in the ABD, and release them when the ABD is freed. Since this additional overhead is totally unnecessary on 4.5+, where head and tail pages share refcounts, I've opted to simply not use the compound head in ABD page iteration there. This is theoretically less efficient (though cleaning up head page references would add overhead), but its safe, and we still get the other benefits of not mapping pages before adding them to a bio and not mis-splitting pages. There doesn't appear to be an obvious symbol name or config option we can match on to discover this behaviour in configure (and the mm/page APIs have changed a lot since then anyway), so I've gone with a simple version check. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `c6be6ce175`)	2024-03-28 13:29:46 -07:00
Rob Norris	cb599d27ed	vdev_disk: use bio_chain() to submit multiple BIOs Simplifies our code a lot, so we don't have to wait for each and reassemble them. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `72fd834c47`)	2024-03-28 13:29:46 -07:00
Rob Norris	af3a5bb40d	vdev_disk: add module parameter to select BIO submission method This makes the submission method selectable at module load time via the `zfs_vdev_disk_classic` parameter, allowing this change to be backported to 2.2 safely, and disabled in favour of the "classic" submission method if new problems come up. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `df2169d141`)	2024-03-28 13:29:46 -07:00
Rob Norris	51c2bd0def	vdev_disk: rewrite BIO filling machinery to avoid split pages This commit tackles a number of issues in the way BIOs (`struct bio`) are constructed for submission to the Linux block layer. The kernel has a hard upper limit on the number of pages/segments that can be added to a BIO, as well as a separate limit for each device (related to its queue depth and other scheduling characteristics). ZFS counts the number of memory pages in the request ABD (`abd_nr_pages_off()`, and then uses that as the number of segments to put into the BIO, up to the hard upper limit. If it requires more than the limit, it will create multiple BIOs. Leaving aside the fact that page count method is wrong (see below), not limiting to the device segment max means that the device driver will need to split the BIO in half. This is alone is not necessarily a problem, but it interacts with another issue to cause a much larger problem. The kernel function to add a segment to a BIO (`bio_add_page()`) takes a `struct page` pointer, and offset+len within it. `struct page` can represent a run of contiguous memory pages (known as a "compound page"). In can be of arbitrary length. The ZFS functions that count ABD pages and load them into the BIO (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never consider a page to be more than `PAGE_SIZE` (4K), even if the `struct page` is for multiple pages. In this case, it will load the same `struct page` into the BIO multiple times, with the offset adjusted each time. With a sufficiently large ABD, this can easily lead to the BIO being entirely filled much earlier than it could have been. This is also further contributes to the problem caused by the incorrect segment limit calculation, as its much easier to go past the device limit, and so require a split. Again, this is not a problem on its own. The logic for "never submit more than `PAGE_SIZE`" is actually a little more subtle. It will actually never submit a buffer that crosses a 4K page boundary. In practice, this is fine, as most ABDs are scattered, that is a list of complete 4K pages, and so are loaded in as such. Linear ABDs are typically allocated from slabs, and for small sizes they are frequently not aligned to page boundaries. For example, a 12K allocation can span four pages, eg: -- 4K -- -- 4K -- -- 4K -- -- 4K -- \| \| \| \| \| :## ######## ######## ######: [1K, 4K, 4K, 3K] Such an allocation would be loaded into a BIO as you see: [1K, 4K, 4K, 3K] This tends not to be a problem in practice, because even if the BIO were filled and needed to be split, each half would still have either a start or end aligned to the logical block size of the device (assuming 4K at least). --- In ideal circumstances, these shortcomings don't cause any particular problems. Its when they start to interact with other ZFS features that things get interesting. Aggregation will create a "gang" ABD, which is simply a list of other ABDs. Iterating over a gang ABD is just iterating over each ABD within it in turn. Because the segments are simply loaded in order, we can end up with uneven segments either side of the "gap" between the two ABDs. For example, two 12K ABDs might be aggregated and then loaded as: [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] Should a split occur, each individual BIO can end up either having an start or end offset that is not aligned to the logical block size, which some drivers (eg SCSI) will reject. However, this tends not to happen because the default aggregation limit usually keeps the BIO small enough to not require more than one split, and most pages are actually full 4K pages, so hitting an uneven gap is very rare anyway. If the pool is under particular memory pressure, then an IO can be broken down into a "gang block", a 512-byte block composed of a header and up to three block pointers. Each points to a fragment of the original write, or in turn, another gang block, breaking the original data up over and over until space can be found in the pool for each of them. Each gang header is a separate 512-byte memory allocation from a slab, that needs to be written down to disk. When the gang header is added to the BIO, its a single 512-byte segment. Pulling all this together, consider a large aggregated write of gang blocks. This results a BIO containing lots of 512-byte segments. Given our tendency to overfill the BIO, a split is likely, and most possible split points will yield a pair of BIOs that are misaligned. Drivers that care, like the SCSI driver, will reject them. --- This commit is a substantial refactor and rewrite of much of `vdev_disk` to sort all this out. `vdev_bio_max_segs()` now returns the ideal maximum size for the device, if available. There's also a tuneable `zfs_vdev_disk_max_segs` to override this, to assist with testing. We scan the ABD up front to count the number of pages within it, and to confirm that if we submitted all those pages to one or more BIOs, it could be split at any point with creating a misaligned BIO. If the pages in the BIO are not usable (as in any of the above situations), the ABD is linearised, and then checked again. This is the same technique used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size and allocator quirks. `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The idea is simply that it can hold all the state needed to create, submit and return multiple BIOs, including all the refcounts, the ABD copy if it was needed, and so on. Apart from what I hope is a clearer interface, the major difference is that because we know how many BIOs we'll need up front, we don't need the old overflow logic that would grow the BIO array, throw away all the old work and restart. We can get it right from the start. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `06a196020e`)	2024-03-28 13:29:46 -07:00
Rob Norris	03ff875e09	vdev_disk: make read/write IO function configurable This is just setting up for the next couple of commits, which will add a new IO function and a parameter to select it. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `c4a13ba483`)	2024-03-28 13:29:46 -07:00
Rob Norris	13b5348848	vdev_disk: reorganise vdev_disk_io_start Light reshuffle to make it a bit more linear to read and get rid of a bunch of args that aren't needed in all cases. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `867178ae1d`)	2024-03-28 13:29:46 -07:00
Rob Norris	4820185031	vdev_disk: rename existing functions to vdev_classic_* This is just renaming the existing functions we're about to replace and grouping them together to make the next commits easier to follow. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `f3b85d706b`)	2024-03-28 13:29:46 -07:00
Rob Norris	52a2af6fd1	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588 (cherry picked from commit `390b448726`)	2024-03-28 13:29:46 -07:00
Rob N	58211157bf	Linux 6.8 compat: use splice_copy_file_range() for fallback Linux 6.8 removes generic_copy_file_range(), which had been reduced to a simple wrapper around splice_copy_file_range(). Detect that function directly and use it if generic_ is not available. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15930 Closes #15931 (cherry picked from commit `ef08a4d406`)	2024-03-21 09:35:17 -07:00
Alexander Motin	c0c4866f8a	dmu: Allow buffer fills to fail When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665	2024-02-20 15:53:02 -08:00
Umer Saleem	08fd5ccc38	Improve performance for zpool trim on linux On Linux, ZFS uses blkdev_issue_discard in vdev_disk_io_trim to issue trim command which is synchronous. This commit updates vdev_disk_io_trim to use __blkdev_issue_discard, which is asynchronous. Unfortunately there isn't any asynchronous version for blkdev_issue_secure_erase, so performance of secure trim will still suffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15843	2024-02-06 12:58:55 -08:00
Tony Hutter	00d85a98ea	BRT: Fix FICLONE/FICLONERANGE shortened copy On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never\|auto\|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15728 Closes #15842	2024-02-06 10:01:15 -08:00
Rob Norris	09e6724e1e	Linux 6.8 compat: replace MAX_ORDER define MAX_ORDER has been renamed to MAX_PAGE_ORDER. Rather than just redefining it, instead define our own name and set it consistently from the start. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Rob Norris	7466e09a49	Linux 6.8 compat: implement strlcpy fallback Linux has removed strlcpy in favour of strscpy. This implements a fallback implementation of strlcpy for this case. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Rob Norris	ce782d0804	Linux 6.8 compat: update for new bdev access functions blkdev_get_by_path() and blkdev_put() have been replaced by bdev_open_by_path() and bdev_release(), which return a "handle" object with the bdev object itself inside. This adds detection for the new functions, and macros to handle the old and new forms consistently. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15805	2024-01-29 14:53:29 -08:00
Tino Reichardt	276be5357c	linux spl: fix typo in top comment of spl-condvar.c Credential Implementation -> Condition Variables Implementation Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #15782	2024-01-29 14:53:29 -08:00
youzhongyang	6b64acc157	Make spl_kmem_cache size check consistent On Linux x86_64, kmem cache can have size up to 4M, however increasing spl_kmem_cache_slab_limit can lead to crash due to the size check inconsistency. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #15757	2024-01-29 14:53:29 -08:00
Mark Johnston	22e4f08c30	Linux: Defer loading the object set in zfs_setattr() We need to wait until after having done a zfs_enter() to load some fields from the zfsvfs structure. Otherwise a use-after-free is possible in the face of a concurrent rollback. Other functions in this file are careful to avoid this bug, I believe this is the only instance. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15752	2024-01-29 14:53:29 -08:00
Pawel Jakub Dawidek	3425484eb9	Fix file descriptor leak on pool import. Descriptor leak can be easily reproduced by doing: # zpool import tank # sysctl kern.openfiles # zpool export tank; zpool import tank # sysctl kern.openfiles We were leaking four file descriptors on every import. Similar leak most likely existed when using file-based VDEVs. External-issue: https://reviews.freebsd.org/D43529 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15630	2024-01-26 13:38:25 -08:00
Rob N	2ecc2dfe42	Linux 6.7 compat: zfs_setattr fix atime update In `db4fc559c` I messed up and changed this bit of code to set the inode atime to an uninitialised value, when actually it was just supposed to loading the atime from the inode to be stored in the SA. This changes it to what it should have been. Ensure times change by the right amount Previously, we only checked if the times changed at all, which missed a bug where the atime was being set to an undefined value. Now ensure the times change by two seconds (or thereabouts), ensuring we catch cases where we set the time to something bonkers Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #15762 Closes #15773	2024-01-17 08:59:28 -08:00
Alexander Motin	ad47eca195	ZIL: Assert record sizes in different places This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15517	2024-01-08 16:11:39 -08:00
Alexander Motin	a8c29a79df	Linux: Reclaim unused spl_kmem_cache_reclaim It is unused for 3 years since #10576. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15507	2024-01-08 16:11:39 -08:00
Alexander Motin	f13593619b	FreeBSD: Optimize large kstat outputs - Use sbuf_new_for_sysctl() to reduce double-buffering on sysctl output. - Use much faster sbuf_cat() instead of sbuf_printf("%s"). Together it reduces `sysctl kstat.zfs.misc.dbufs` time from minutes to seconds, making dbufstat almost usable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15495	2024-01-08 16:11:39 -08:00
Alan Somers	c34fe8dcbc	Update the kstat dataset_name when renaming a zvol Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15482 Closes #15486	2024-01-08 16:11:39 -08:00
Brian Behlendorf	d530d5d8a5	Linux 6.5 compat: check BLK_OPEN_EXCL is defined On some systems we already have blkdev_get_by_path() with 4 args but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined. The vdev_bdev_mode() function was added to handle this case but there was no generic way to specify exclusive access. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15692	2023-12-21 16:19:48 -08:00
Rob Norris	03b84099d9	linux 6.7 compat: rework shrinker setup for heap allocations 6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
Rob Norris	18a9185165	linux 6.7 compat: handle superblock shrinker member change In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
Rob Norris	3c13601a12	linux 6.7 compat: use inode atime/mtime accessors 6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in `b37f29341` to atime and mtime as well. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn	2023-12-21 11:03:08 -08:00
rmacklem	522414da3b	FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visible Call vfs_exjail_clone() for mounts created under .zfs/snapshot to fill in the mnt_exjail field for the mount. If this is not done, the snapshots under .zfs/snapshot with not be accessible over NFS. This version has the argument name in vfs.h fixed to match that of the name in spl_vfs.c, although it really does not matter. External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #15563	2023-11-29 14:08:46 -08:00
Alan Somers	349fb77f11	FreeBSD: Fix the build on FreeBSD 12 It was broken for several reasons: * VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in. * The location of the zlib header moved in 13.0 and 12.1. We can drop support for building on 12.0, which is EoL. * knlist_init lost an argument in 13.0. OpenZFS change `9d0887402b` assumed 13.0 or later. * FreeBSD 13.0 added copy_file_range, and OpenZFS change `67a1b03791` assumed 13.0 or later. Sponsored-by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #15551	2023-11-28 15:19:07 -08:00
Alexander Motin	56a2a0981e	ZIL: Do not encrypt block pointers in lr_clone_range_t In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513	2023-11-28 11:17:52 -08:00
Tony Hutter	479dca51c6	zfs-2.2.1: Disable block cloning by default Disable block cloning by default to mitigate possible data corruption (see #15529 and #15526). Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2023-11-16 14:23:03 -08:00
Rich Ercolani	87e9e82865	Add a tunable to disable BRT support. Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #15529	2023-11-16 14:23:03 -08:00
Low-power	f2fe4d51a8	Linux: reject read/write mapping to immutable file only on VM_SHARED Private read/write mapping can't be used to modify the mapped files, so they will remain be immutable. Private read/write mappings are usually used to load the data segment of executable files, rejecting them will rendering immutable executable files to stop working. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #15344	2023-11-16 14:23:03 -08:00
Umer Saleem	44c8ff9b0c	Linux 6.6 compat: fix implicit conversion error with debug build With Linux v6.6.0 and GCC 12, when debug build is configured, implicit conversion error is raised while converting 'enum <anonymous>' to 'boolean_t'. Use 'B_TRUE' instead of 'true' to fix the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #15489	2023-11-16 14:23:03 -08:00
Alexander Motin	3ec4ea68d4	Unify arc_prune_async() code There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15456	2023-11-08 12:15:41 -08:00
Coleman Kane	3f67e012e4	Linux 6.6 compat: fsync_bdev() has been removed in favor of sync_blockdev() In Linux commit 560e20e4bf6484a0c12f9f3c7a1aa55056948e1e, the fsync_bdev() function was removed in favor of sync_blockdev() to do (roughly) the same thing, given the same input. This change conditionally attempts to call sync_blockdev() if fsync_bdev() isn't discovered during configure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15263	2023-11-08 12:15:41 -08:00
Coleman Kane	21875dd090	Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2 In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument for the request_mask was added to generic_fillattr. This is the same request_mask for statx that's present in the most recent API implemented by zpl_getattr_impl. This commit conditionally adds it to the zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...) implementation, when configure determines it's present in the kernel's generic_fillattr(...). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15263	2023-11-08 12:15:41 -08:00
Coleman Kane	fe9d409e90	Linux 6.6 compat: use inode_get/set_ctime(...) In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access to the i_ctime member of struct inode was removed. The new approach is to use accessor methods that exclusively handle passing the timestamp around by value. This change adds new tests for each of these functions and introduces zpl_ equivalents in include/os/linux/zfs/sys/zpl.h. In where the inode_get/set_ctime() functions exist, these zpl_ calls will be mapped to the new functions. On older kernels, these macros just wrap direct-access calls. The code that operated on an address of ip->i_ctime to call ZFS_TIME_DECODE() now will take a local copy using zpl_inode_get_ctime(), and then pass the address of the local copy when performing the ZFS_TIME_DECODE() call, in all cases, rather than directly accessing the member. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15263 Closes #15257	2023-11-08 12:15:41 -08:00
Tony Hutter	8ba748d414	Revert "zvol: Temporally disable blk-mq" This reverts commit `aefb6a2bd6`. `aefb6a2bd` temporally disabled blk-mq until we could fix a fix for Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15439	2023-11-06 16:47:32 -08:00
Tony Hutter	e860cb0200	zvol: Remove broken blk-mq optimization This fix removes a dubious optimization in zfs_uiomove_bvec_rq() that saved the iterator contents of a rq_for_each_segment(). This optimization allowed restoring the "saved state" from a previous rq_for_each_segment() call on the same uio so that you wouldn't need to iterate though each bvec on every zfs_uiomove_bvec_rq() call. However, if the kernel is manipulating the requests/bios/bvecs under the covers between zfs_uiomove_bvec_rq() calls, then it could result in corruption from using the "saved state". This optimization results in an unbootable system after installing an OS on a zvol with blk-mq enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #15351	2023-11-06 16:47:24 -08:00
Olivier Certner	edebca5dfc	FreeBSD: taskq: Remove unused declaration Variable 'uma_align_cache' has not been used since commit "FreeBSD: Use a hash table for taskqid lookups" (`3933305ea`). Moreover, it is soon going to become private to FreeBSD's UMA in 15.0-CURRENT (main), 14.0-STABLE (stable/14) and 13.2-STABLE (stable/13). Should accessing this information become necessary again, one will have to use the new accessors for recent versions. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Certner <olce.freebsd@certner.fr> Closes #15416	2023-11-06 16:46:32 -08:00

1 2 3 4 5 ...

822 Commits