mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
Ryan Lahfa	fdb8fff916	linux/spl/kmem_cache: undefine `kmem_cache_alloc` before defining it When compiling a kernel with bcachefs and zfs, the two macros will collide, making it impossible to have both filesystems. It is sufficient to just undefine the macro before calling it. On why this should be in ZFS rather than bcachefs, currently, bcachefs is not a in-tree filesystem, but, it has a reasonably high chance of getting included soon. This avoids the breakage in ZFS early, this patch may be distributed downstream in NixOS and is already used there. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz> Closes #15144	2023-08-07 13:55:59 -07:00
Alexander Motin	6c94e64963	Refactor dmu_prefetch(). - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15076	2023-08-07 13:54:41 -07:00
Mateusz Piotrowski	a97b8fc2dd	Fix some typos Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Closes #15141	2023-08-07 13:53:59 -07:00
Rob N	114a39964f	zdb: include cloned blocks in block statistics This gives `zdb -b` support for clone blocks. Previously, it didn't know what clones were, so would count their space allocation multiple times and then report leaked space (or, in debug, would assert trying to claim blocks a second time). This commit fixes those bugs, and reports the number of clones and the space "used" (saved) by them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15123	2023-08-01 08:56:30 -07:00
Coleman Kane	6751634d77	Linux 4.20 compat: wrapper function for iov_iter type access An iov_iter_type() function to access the "type" member of the struct iov_iter was added at one point. Move the conditional logic to decide which method to use for accessing it into a macro and simplify the zpl_uio_init code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-08-01 08:42:33 -07:00
Coleman Kane	325505e5c4	Linux 6.4 compat: iter_iov() function now used to get old iov member The iov_iter->iov member is now iov_iter->__iov and must be accessed via the accessor function iter_iov(). Create a wrapper that is conditionally compiled to use the access method appropriate for the target kernel version. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15100	2023-08-01 08:42:26 -07:00
Coleman Kane	43e8f6e37f	Linux 6.5 compat: blkdev changes Multiple changes to the blkdev API were introduced in Linux 6.5. This includes passing (void* holder) to blkdev_put, adding a new blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type that replaces uses of fmode_t, and removing an argument from the release handler on block_device_operations that we weren't using. The open function definition has also changed to take gendisk* and blk_mode_t, so update it accordingly, too. Implement local wrappers for blkdev_get_by_path() and vdev_blkdev_put() so that the in-line calls are cleaner, and place the conditionally-compiled implementation details inside of both of these local wrappers. Both calls are exclusively used within vdev_disk.c, at this time. Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE The wrapper function is now used for testing using the appropriate method for the kernel, whether the open mode is writable or not. Emphasize fmode_t arg in zvol_release is not used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15099	2023-08-01 08:37:20 -07:00
Coleman Kane	3b8e318b77	Linux 6.5 compat: use disk_check_media_change when it exists When disk_check_media_change() exists, then define zfs_check_media_change() to simply call disk_check_media_change() on the bd_disk member of its argument. Since disk_check_media_change() is newer than when revalidate_disk was present in bops, we should be able to safely do this via a macro, instead of recreating a new implementation of the inline function that forces revalidation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15101	2023-08-01 08:32:38 -07:00
Alexander Motin	b22bab2547	Remove fastwrite mechanism. Fastwrite was introduced many years ago to improve ZIL writes spread between multiple top-level vdevs by tracking number of allocated but not written blocks and choosing vdev with smaller count. It suposed to reduce ZIL knowledge about allocation, but actually made ZIL to even more actively report allocation code about the allocations, complicating both ZIL and metaslabs code. On top of that, it seems ZIO_FLAG_FASTWRITE setting in dmu_sync() was lost many years ago, that was one of the declared benefits. Plus introduction of embedded log metaslab class solved another problem with allocation rotor accounting both normal and log allocations, since in most cases those are now in different metaslab classes. After all that, I'd prefer to simplify already too complicated ZIL, ZIO and metaslab code if the benefit of complexity is not obvious. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15107	2023-07-28 13:30:33 -07:00
Rob Norris	6b0a4be5fe	linux: implement filesystem-side copy/clone functions for EL7 Redhat have backported copy_file_range and clone_file_range to the EL7 kernel using an "extended file operations" wrapper structure. This connects all that up to let cloning work there too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-24 16:37:04 -07:00
Rob Norris	9927f219f1	linux: implement filesystem-side clone ioctls Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and were implemented as regular filesystem-specific ioctls. This implements those ioctls directly in OpenZFS, allowing cloning to work on older kernels. There's no need to gate these behind version checks; on later kernels Linux will simply never deliver these ioctls, instead calling the approprate VFS op. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-24 16:36:54 -07:00
Rob Norris	5a35c68b67	linux: implement filesystem-side copy/clone functions This implements the Linux VFS ops required to service the file copy/clone APIs: .copy_file_range (4.5+) .clone_file_range (4.5-4.19) .dedupe_file_range (4.5-4.19) .remap_file_range (4.20+) Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are hooked up here, but are not implemented yet. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes #15050	2023-07-24 16:36:38 -07:00
Alexander Motin	34b3d498a9	Adjust prefetch parameters. - Reduce maximum prefetch distance for 32bit platforms to 8MB as it was previously. Those systems didn't grow much probably, so better stay conservative there. - Retire array_rd_sz tunable, blocking prefetch for large requests. We should not penalize applications trying to be more efficient. The speculative prefetcher by itself has reasonable distance limits, and 1MB is not much at all these days. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15072	2023-07-21 11:51:47 -07:00
Alexander Motin	28430b51e3	Add explicit prefetches to bpobj_iterate(). To simplify error handling bpobj_iterate_blkptrs() iterates through the list of block pointers backwards. Unfortunately speculative prefetcher is currently unable to detect such patterns, that makes each block read there synchronous and very slow on HDD pools. According to my tests, added explicit prefetch reduces time needed to asynchronously delete 8 snapshots of 4 million blocks each from 20 seconds to less than one, that should free sync thread for other useful work, such as async writes, scrub, etc. While there, plug one memory leak in case of bpobj_open() error and harmonize some variable names. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15071	2023-07-21 11:50:48 -07:00
Alan Somers	6fd87e1d8d	Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #14717 Closes #15052	2023-07-21 11:49:26 -07:00
Alan Somers	cf2a225b24	Don't emit checksum histograms in ereport.fs.zfs.checksum events The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15052	2023-07-21 11:48:17 -07:00
Ameer Hamza	d9bb583c25	spa_min_alloc should be GCD, not min Since spa_min_alloc may not be a power of 2, unlike ashifts, in the case of DRAID, we should not select the minimal value among several vdevs. Rounding to a multiple of it is unlikely to work for other vdevs. Instead, using the greatest common divisor produces smaller yet more reasonable results. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15067	2023-07-20 10:23:52 -07:00
Ameer Hamza	4d2dad04aa	Ignore pool ashift property during vdev attachment Ashift can be set for a vdev only during its creation, and the top-level vdev does not change when a vdev is attached or replaced. The ashift property should not be used during attachment, as it does not allow attaching/replacing a vdev if the pool's ashift property is increased after the existing vdev was created. Instead, we should be able to attach the vdev if the attached vdev can satisfy the ashift requirement with its parent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15061	2023-07-20 09:57:16 -07:00
Coleman Kane	74f8ce4ca5	Linux 6.5 compat: disk_check_media_change() was added The disk_check_media_change() function was added which replaces bdev_check_media_change. This change was introduced in 6.5rc1 444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk is now used to pass the expected data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15060	2023-07-20 09:09:25 -07:00
Yuri Pankov	8beabfd3bf	set autotrim default to 'off' everywhere As it turns out having autotrim default to 'on' on FreeBSD never really worked due to mess with defines where userland and kernel module were getting different default values (userland was defaulting to 'off', module was thinking it's 'on'). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15079	2023-07-20 09:06:55 -07:00
Coleman Kane	d3d63cac4d	Linux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT This change was introduced in Linux commit 7ba150834b840f6f5cdd07ca69a4ccf39df59a66 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15059	2023-07-14 16:33:51 -07:00
Coleman Kane	3a3e0d6fbc	intptr_t definition is canonically signed Make the version here match that elsewhere in the kernel and system headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #15058	2023-07-14 16:32:49 -07:00
Mateusz Guzik	6c9aa1d2a6	FreeBSD: catch up to __FreeBSD_version 1400093 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #15036	2023-07-13 09:06:57 -07:00
Brian Behlendorf	ac8ae18d22	Revert "spa.h: use IN_BASE instead of IN_FREEBSD_BASE" This reverts commit `77a3bb1f47`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2023-06-30 10:03:46 -07:00
Alexander Motin	b4a0873092	Some ZIO micro-optimizations. - Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14985	2023-06-30 08:54:00 -07:00
Alexander Motin	fa7b2390d4	Do not report bytes skipped by scan as issued. Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as `c85ac731a0`, but from a different perspective, complementing it. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15007	2023-06-30 08:47:13 -07:00
Yuri Pankov	77a3bb1f47	spa.h: use IN_BASE instead of IN_FREEBSD_BASE Consistently get the proper default value for autotrim. Currently, only the kernel module is built with IN_FREEBSD_BASE, and libzfs get the wrong default value, leading to confusion and incorrect output when autotrim value was not set explicitly. Reviewed-by: Warner Losh <imp@bsdimp.com> Signed-off-by: Yuri Pankov <yuripv@FreeBSD.org> Closes #15016	2023-06-29 11:50:52 -07:00
Alexander Motin	8469b5aac0	Another set of vdev queue optimizations. Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14925	2023-06-27 09:09:48 -07:00
Rich Ercolani	35a6247c5f	Add a delay to tearing down threads. It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14938	2023-06-26 13:57:12 -07:00
Alexander Motin	ccec7fbe1c	Remove ARC/ZIO physdone callbacks. Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14948	2023-06-15 10:49:03 -07:00
Alexander Motin	d057807ede	Switch refcount tracking from lists to AVL-trees. With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14970	2023-06-14 08:02:27 -07:00
Alexander Motin	70ea484e3e	Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14953	2023-06-09 12:40:55 -07:00
Rob Norris	2b9f8ba673	znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:53:44 -07:00
Alexander Motin	5ba4025a8d	Introduce zfs_refcount_(add\|remove)_few(). There are two places where we need to add/remove several references with semantics of zfs_refcount_(add\|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14934	2023-06-05 11:51:44 -07:00
Richard Yao	0f03a41161	Use __attribute__((malloc)) on memory allocation functions This informs the C compiler that pointers returned from these functions do not alias other functions, which allows it to do better code optimization and should make the compiled code smaller. References: https://stackoverflow.com/a/53654773 https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute https://clang.llvm.org/docs/AttributeReference.html#malloc Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14827	2023-05-26 15:47:52 -07:00
Richard Yao	677c6f8457	btree: Implement faster binary search algorithm This implements a binary search algorithm for B-Trees that reduces branching to the absolute minimum necessary for a binary search algorithm. It also enables the compiler to inline the comparator to ensure that the only slowdown when doing binary search is from waiting for memory accesses. Additionally, it instructs the compiler to unroll the loop, which gives an additional 40% improve with Clang and 8% improvement with GCC. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm. Micro-benchmarks suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14866	2023-05-26 10:03:12 -07:00
Alexander Motin	b6fbe61fa6	zil: Add some more statistics. In addition to a number of actual log bytes written, account also a total written bytes including padding and total allocated bytes (bytes <= write <= alloc). It should allow to monitor zil traffic and space efficiency. Add dtrace probe for zil block size selection. Make zilstat report more information and fit it into less width. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14863	2023-05-25 13:51:53 -07:00
Alexander Motin	f63811f072	ZIL: Reduce scope of per-dataset zl_issuer_lock. Before this change ZIL copied all log data while holding the lock. It caused huge lock contention on workloads with many big parallel writes. This change splits the process into two parts: first, zil_lwb_assign() estimates the log space needed for all transactions, and zil_lwb_write_close() allocates blocks and zios while holding the lock, then, after the lock in dropped, zil_lwb_commit() copies the data, and zil_lwb_write_issue() issues the I/Os. Also while there slightly reduce scope of zl_lock. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14841	2023-05-25 09:48:43 -07:00
George Amanakis	482eeef804	Teach zpool scrub to scrub only blocks in error log Added a flag '-e' in zpool scrub to scrub only blocks in error log. A user can pause, resume and cancel the error scrub by passing additional command line arguments -p -s just like a regular scrub. This involves adding a new flag, creating new libzfs interfaces, a new ioctl, and the actual iteration and read-issuing logic. Error scrubbing is executed in multiple txg to make sure pool performance is not affected. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain tulsi.jain@delphix.com Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #8995 Closes #12355	2023-05-18 11:59:42 -07:00
Brian Behlendorf	e34e15ed6d	Add the ability to uninitialize zpool initialize functions well for touching every free byte...once. But if we want to do it again, we're currently out of luck. So let's add zpool initialize -u to clear it. Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12451 Closes #14873	2023-05-18 10:02:20 -07:00
Pawel Jakub Dawidek	d0d91f185e	Don't use dmu_buf_is_dirty() for unassigned transaction. The dmu_buf_is_dirty() call doesn't make sense here for two reasons: 1. txg is 0 for unassigned tx, so it was a no-op. 2. It is equivalent of checking if we have dirty records and we are doing this few lines earlier. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:57 -07:00
Pawel Jakub Dawidek	bd8c6bd66f	Deny block cloning is dbuf size doesn't match BP size. I don't know an easy way to shrink down dbuf size, so just deny block cloning into dbufs that don't match our BP's size. This fixes the following situation: 1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB. 2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB. 3. Truncate the large file to 0. Its dbuf will remain 2kB. 4. Clone the small file into the large file. Small file's BP lsize is 1kB, but the large file's dbuf is 2kB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:52 -07:00
Pawel Jakub Dawidek	555ef90c5c	Additional block cloning fixes. Reimplement some of the block cloning vs dbuf logic, mostly to fix situation where we clone a block and in the same transaction group we want to partially overwrite the clone. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14825	2023-05-11 16:06:36 -07:00
Brian Behlendorf	903c3613d4	Add dmu_tx_hold_append() interface Provides an interface which callers can use to declare a write when the exact starting offset in not yet known. Since the full range being updated is not available only the first L0 block at the provided offset will be prefetched. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14819	2023-05-09 09:03:10 -07:00
George Amanakis	6839ec6f10	Enable the head_errlog feature to remove errors In case check_filesystem() does not error out and does not report an error, remove that error block from error lists and logs without requiring a scrub. This can happen when the original file and all snapshots/clones referencing it have been removed. Otherwise zpool status will still report that "Permanent errors have been detected..." without actually reporting any of them. To implement this change the functions introduced in corrective receive were modified to take into account the head_errlog feature. Before this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: ============================= After this change: ============================= pool: test state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 /home/user/vdev_a ONLINE 0 0 2 errors: No known data errors ============================= Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14813	2023-05-09 08:53:27 -07:00
Matthew Ahrens	3095ca91c2	Verify block pointers before writing them out If a block pointer is corrupted (but the block containing it checksums correctly, e.g. due to a bug that overwrites random memory), we can often detect it before the block is read, with the `zfs_blkptr_verify()` function, which is used in `arc_read()`, `zio_free()`, etc. However, such corruption is not typically recoverable. To recover from it we would need to detect the memory error before the block pointer is written to disk. This PR verifies BP's that are contained in indirect blocks and dnodes before they are written to disk, in `dbuf_write_ready()`. This way, we'll get a panic before the on-disk data is corrupted. This will help us to diagnose what's causing the corruption, as well as being much easier to recover from. To minimize performance impact, only checks that can be done without holding the spa_config_lock are performed. Additionally, when corruption is detected, the raw words of the block pointer are logged. (Note that `dprintf_bp()` is a no-op by default, but if enabled it is not safe to use with invalid block pointers.) Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14817	2023-05-08 11:20:23 -07:00
Brian Behlendorf	012829df0c	Wrap clang specific pragma Clang specific pragmas need to be wrapped to prevent a build warning when compiling with gcc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14814	2023-05-02 09:21:47 -07:00
Justin Hibbits	5a83f761c7	powerpc64: Support ELFv2 asm on Big Endian FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian. The existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE is ELFv2. Minor changes to add ELFv2 in the BE side gets this working correctly on FreeBSD with latest OpenZFS import. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Justin Hibbits <chmeeedalf@gmail.com> Closes #14779	2023-04-27 12:49:21 -07:00
Brian Behlendorf	0e8a42bbee	Revert "Fix data race between zil_commit() and zil_suspend()" This reverts commit `4c856fb333` to resolve a newly introduced deadlock which in practice in more disruptive that the issue this commit intended to address. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #14775 Closes #14790	2023-04-25 16:40:55 -07:00
Han Gao	6d59d5df98	Add loongarch64 support Add loongarch64 definitions & lua module setjmp asm LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Han Gao <gaohan@uniontech.com> Signed-off-by: WANG Xuerui <xen0n@gentoo.org> Closes #13422	2023-04-25 16:05:45 -07:00
Allan Jude	8eae2d214c	Add support for zpool user properties Usage: zpool set org.freebsd:comment="this is my pool" poolname Tests are based on zfs_set's user property tests. Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Beckhoff Automation GmbH & Co. KG. Sponsored-by: Klara Inc. Closes #11680	2023-04-21 10:20:36 -07:00
Richard Yao	135d9a9048	Linux: Suppress -Wordered-compare-function-pointers in tracepoint code Clang points out that there is a comparison against -1, but we cannot fix it because that is from the kernel headers, which we must support. We can workaround this by using a pragma. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14738	2023-04-20 10:31:01 -07:00
rob-wing	3e4ed4213d	Create zap for root vdev And add it to the AVZ, this is not backwards compatible with older pools due to an assertion in spa_sync() that verifies the number of ZAPs of all vdevs matches the number of ZAPs in the AVZ. Granted, the assertion only applies to #DEBUG builds - still, a feature flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2 Notably, this allows to get/set properties on the root vdev: % zpool set user:prop=value <pool> root-0 Before this commit, it was already possible to get/set properties on top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0): % zpool set user:prop=value <pool> mirror-0 This syntax also applies to the root vdev as it is is of type 'root' with a vdev_id of 0, root-0. The keyword 'root' as an alias for 'root-0'. The following tests have been added: - zpool get all properties from root vdev - zpool set a property on root vdev - verify root vdev ZAP is created Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology Submitted-by: Klara, Inc. Closes #14405	2023-04-20 10:07:56 -07:00
Herb Wartens	71d191ef25	Allow MMP to bypass waiting for other threads At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Herb Wartens <hawartens@gmail.com> Closes #14659	2023-04-19 13:22:59 -07:00
Ameer Hamza	719534ca8e	Fix "Detach spare vdev in case if resilvering does not happen" Spare vdev should detach from the pool when a disk is reinserted. However, spare detachment depends on the completion of resilvering, and if resilver does not schedule, the spare vdev keeps attached to the pool until the next resilvering. When a zfs pool contains several disks (25+ mirror), resilvering does not always happen when a disk is reinserted. In this patch, spare vdev is manually detached from the pool when resilvering does not occur and it has been tested on both Linux and FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14722	2023-04-19 09:04:32 -07:00
youzhongyang	23f84d161e	Silence clang warning of flexible array not at end Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14764	2023-04-18 18:10:40 -07:00
youzhongyang	27a82cbb3e	Linux 6.3 compat: Fix memcpy "detected field-spanning write" error Add a new union member of flexible array to dnode_phys_t and use it in the macro so we can silence the memcpy() fortify error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14737	2023-04-13 09:12:03 -07:00
youzhongyang	d4dc53dad2	Linux 6.3 compat: idmapped mount API changes Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap type for mounts (struct mnt_idmap), we need to detect these changes and make zfs work with the new APIs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14682	2023-04-10 14:15:36 -07:00
Kyle Evans	d0cbd9feaf	module: freebsd: fix aarch64 fpu handling Just like x86, aarch64 needs to use the fpu_kern(9) API around FPU usage, otherwise we panic promptly at boot as soon as ZFS attempts to do checksum benchmarking. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Kyle Evans <kevans@FreeBSD.org> Closes #14715	2023-04-10 12:40:18 -07:00
Rob N	baca06c258	libzfs: add v2 iterator interfaces `f6a0dac84` modified the zfs_iter_* functions to take a new "flags" parameter, and introduced a variety of flags to ask the kernel to limit the results in various ways, reducing the amount of work the caller needed to do to filter out things they didn't need. Unfortunately this change broke the ABI for existing clients (read: older versions of the `zfs` program), and was reverted `399b98198`. `dc95911d2` reintroduced the original patch, with the understanding that a backwards-compatible fix would be made before the 2.2 release branch was tagged. This commit is that fix. This introduces zfs_iter_*_v2 functions that have the new flags argument, and reverts the existing functions to not have the flags parameter, as they were before. The old functions are now reimplemented in terms of the new, with flags set to 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Original-patch-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Closes #14597	2023-04-10 11:53:02 -07:00
Martin Matuška	a3f82aec93	Miscellaneous FreBSD compilation bugfixes Add missing machine/md_var.h to spl/sys/simd_aarch64.h and spl/sys/simd_arm.h In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE on i386 In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need a hack in vnode.h similar to Linux. sys/simd.h needs to be included early. In zfs_freebsd_copy_file_range() we pass a (size_t )lenp to zfs_clone_range() that expects a (uint64_t ) Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and sha512_impl.c to __ARM_ARCH > 6 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by: Signed-off-by: WHR <msl0000023508@gmail.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #14674	2023-04-06 10:35:02 -07:00
Tino Reichardt	6ecdd35bdb	Fix "Add colored output to zfs list" Running `zfs list -o avail rpool` resulted in a core dump. This commit will fix this. Run the needed overhead only, when `use_color()` is true. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14712	2023-04-05 09:57:01 -07:00
youzhongyang	c5431f1465	linux 6.3 compat: needs REQ_PREFLUSH \| REQ_OP_WRITE Modify bio_set_flush() so if kernel version is >= 4.10, flags REQ_PREFLUSH and REQ_OP_WRITE are set together. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #14695	2023-03-31 09:46:22 -07:00
George Amanakis	431083f75b	Fixes in persistent error log Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accommodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14633	2023-03-28 16:51:58 -07:00
Kevin Jin	65d10bd87c	Fix short-lived txg caused by autotrim Current autotrim causes short-lived txg through: 1. calling txg_wait_synced() in metaslab_enable() 2. calling txg_wait_open() with should_quiesce = true This patch addresses all the issues mentioned above. A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity. It will be signaled once a txg is synced so that it does not change the original autotrim pace. Also because it is a cv, the wait is interruptible which speeds up the vdev_autotrim_stop_wait() call. Finally, combining big zfs_txg_timeout, txg_wait_open() also causes delay when exporting a pool. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: jxdking <lostking2008@hotmail.com> Issue #8993 Closes #12194	2023-03-28 08:43:41 -07:00
Rich Ercolani	ae0b1f66c7	linux 6.3 compat: add another bdev_io_acct case Linux 6.3+, and backports from it (6.2.8+), changed the signatures on bdev_io_{start,end}_acct. Add a case for it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14658 Closes #14668	2023-03-27 11:29:19 -07:00
Rich Ercolani	0ad5f43442	Drop lying to the compiler in the fletcher4 code This is probably the uncontroversial part of #13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14649	2023-03-24 10:29:19 -07:00
George Wilson	460d887c43	panic loop when removing slog device There is a window in the slog removal code where a panic loop could ensue if the system crashes during that operation. The original design of slog removal did not persisted any state because the removal happened synchronously. This was changed by a later commit which persisted the vdev_removing flag and exposed this bug. If a slog removal is in progress and happens to crash after persisting the vdev_removing flag to the label but before the vdev is removed from the spa config, then the pool will continue to panic on import. Here's a sample of the panic: [ 134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.393865] PANIC at dmu.c:1135:dmu_write() [ 134.396035] Kernel panic - not syncing: VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted: P OE 5.4.0-1100-dx2023020205-b3751f8c2-azure #106 [ 134.407938] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 134.407938] Call Trace: [ 134.407938] dump_stack+0x57/0x6d [ 134.407938] panic+0xfb/0x2d7 [ 134.407938] spl_panic+0xcf/0x102 [spl] [ 134.407938] ? traverse_impl+0x1ca/0x420 [zfs] [ 134.407938] ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs] [ 134.407938] ? dnode_hold+0x1b/0x20 [zfs] [ 134.407938] dmu_write+0xc3/0xd0 [zfs] [ 134.407938] ? space_map_alloc+0x55/0x80 [zfs] [ 134.407938] metaslab_sync+0x61a/0x830 [zfs] [ 134.407938] ? queued_spin_unlock+0x9/0x10 [zfs] [ 134.407938] vdev_sync+0x72/0x190 [zfs] [ 134.407938] spa_sync_iterate_to_convergence+0x160/0x250 [zfs] [ 134.407938] spa_sync+0x2f7/0x670 [zfs] [ 134.407938] txg_sync_thread+0x22d/0x2d0 [zfs] [ 134.407938] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 134.407938] thread_generic_wrapper+0x83/0xa0 [spl] [ 134.407938] kthread+0x104/0x140 [ 134.407938] ? kasan_check_write.constprop.0+0x10/0x10 [spl] [ 134.407938] ? kthread_park+0x90/0x90 [ 134.457802] ret_from_fork+0x1f/0x40 This change no longer persists the vdev_removing flag when removing slog devices and also cleans up some code that was added which is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14652	2023-03-24 10:27:07 -07:00
Tino Reichardt	80f2cdcd67	Add more ANSI colors to libzfs Reviewed-by: WHR <msl0000023508@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ethan Coe-Renner <coerenner1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14621	2023-03-24 10:21:19 -07:00
Pawel Jakub Dawidek	ce0e1cc402	Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Coverity ID: CID-1535375 Reported-by: Richard Yao Reported-by: Benjamin Coddington Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14655	2023-03-24 10:18:35 -07:00
Tino Reichardt	0f9e735414	Remove unused constant EdonR256_BLOCK_BITSIZE Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14650	2023-03-22 08:39:48 -07:00
Attila Fülöp	5f3611121d	spl: cmn_err_once() should be usable in brace-less if else statements Commit `11913870` (#14567) added cmn_err_once() by #define'ing a compound statement but failed to consider usage in a single statement brace-less if else. Fix the problem by using the common "do {} while (0)" construct. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14629	2023-03-15 11:13:25 -07:00
WHR	d74f123045	Refactor CONFIG_SPE check on Linux/powerpc Commit `5401472` adds a check to call enable_kernel_spe and disable_kernel_spe only if CONFIG_SPE is defined. Refactor this check in a way similar to what CONFIG_ALTIVEC and CONFIG_VSX are checked, in order to remove redundant kfpu_begin() and kfpu_end() implementations. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14623	2023-03-15 10:30:42 -07:00
WHR	f31b0d4e88	Fix missing semicolons in commit `1f196e3` Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14623	2023-03-15 10:30:02 -07:00
Tino Reichardt	fe6a7b787f	Remove unused Edon-R variants This commit removes the edonr_byteorder.h file and all unused variants of Edon-R. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13618	2023-03-14 15:59:58 -07:00
Richard Yao	dbfc622345	nvpair: Use flexible array member for nvpair name strings Coverity reported possible out-of-bounds reads from doing `((char *)(nvp) + sizeof (nvpair_t))` to get the nvpair name string. These were initially marked as false positives, but since we are now using C99 flexible array members elsewhere, we could use them here too as cleanup to make the code easier to understand. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-977165) Reported-by: Coverity (CID-1524109) Reported-by: Coverity (CID-1524642) Closes #14612	2023-03-14 15:25:55 -07:00
Richard Yao	d1807f168e	nvpair: Constify string functions After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:50 -07:00
Richard Yao	47b994049f	Silence clang static analyzer warnings about stored stack addresses Clang's static analyzer complains that nvs_xdr() and nvs_native() functions return pointers to stack memory. That is technically true, but the pointers are stored in stack memory from the caller's stack frame, are not read by the caller and are deallocated when the caller returns, so this is harmless. We set the pointers to NULL to silence the warnings. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:01 -07:00
Tino Reichardt	3a03c96381	Replace dead opensolaris.org license links The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14625	2023-03-14 14:44:01 -07:00
Attila Fülöp	78289b8458	zcommon: Refactor FPU state handling in fletcher4 Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14600	2023-03-14 09:45:28 -07:00
Pawel Jakub Dawidek	67a1b03791	Implementation of block cloning for ZFS Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13392	2023-03-10 11:59:53 -08:00
Low-power	589f59b52a	Workaround for Linux PowerPC GPL-only cpu_has_feature() Linux since 4.7 makes interface 'cpu_has_feature' to use jump labels on powerpc if CONFIG_JUMP_LABEL_FEATURE_CHECKS is enabled, in this case however the inline function references GPL-only symbol 'cpu_feature_keys'. ZFS currently uses 'cpu_has_feature' either directly or indirectly from several places; while it is unknown how this issue didn't break ZFS on 64-bit little-endian powerpc, it is known to break ZFS with many Linux versions on both 32-bit and 64-bit big-endian powerpc. Until this issue is fixed in Linux, we have to workaround it by overriding affected inline functions without depending on 'cpu_feature_keys'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14590	2023-03-10 09:35:00 -08:00
Richard Yao	0b831cabc6	Suppress Clang Static Analyzer warning about SNPRINTF_BLKPTR() Clang's static analyzer pointed out that if we can pass a -1 array index to copyname[copies] if there are no valid DVAs. This is an absurd situation, but it suggests that we are missing an assertion, so we add it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:26 -08:00
Alexander Motin	a8d83e2a24	More adaptive ARC eviction Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14359	2023-03-08 11:17:23 -08:00
Low-power	1f196e3107	Fix build for Linux/powerpc without CONFIG_ALTIVEC or CONFIG_VSX This fixes building ZFS for Linux 4.7+ powerpc* architecture, where Linux was configured without CONFIG_ALTIVEC or CONFIG_VSX. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <msl0000023508@gmail.com> Closes #14591	2023-03-07 14:06:52 -08:00
Rob N	b988f32c70	Better handling for future crypto parameters The intent is that this is like ENOTSUP, but specifically for when something can't be done because we have no support for the requested crypto parameters; eg unlocking a dataset or receiving a stream encrypted with a suite we don't support. Its not intended to be recoverable without upgrading ZFS itself. If the request could be made to work by enabling a feature or modifying some other configuration item, then some other code should be used. load-key: In the future we might have more crypto suites (ie new values for the `encryption` property. Right now trying to load a key on such a future crypto suite will look up suite parameters off the end of the crypto table, resulting in misbehaviour and/or crashes (or, with debug enabled, trip the assertion in `zio_crypt_key_unwrap`). Instead, lets check the value we got from the dataset, and if we can't handle it, abort early. recv: When receiving a raw stream encrypted with an unknown crypto suite, `zfs recv` would report a generic `invalid backup stream` (EINVAL). While technically correct, its not super helpful, so lets ship a more specific error code and message. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #14577	2023-03-07 14:05:14 -08:00
Attila Fülöp	1191387012	spl: Add cmn_err_once() to log a message only on the first call Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14567	2023-03-07 13:44:11 -08:00
Tino Reichardt	f9f9bef22f	Update BLAKE3 for using the new impl handling This commit changes the BLAKE3 implementation handling and also the calls to it from the ztest command. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:27 -08:00
Tino Reichardt	4c5fec01a4	Add generic implementation handling and SHA2 impl The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:21 -08:00
Tino Reichardt	cef1253135	Add SHA2 SIMD feature tests for Linux These are added: - zfs_neon_available() for arm and aarch64 - zfs_sha256_available() for arm and aarch64 - zfs_sha512_available() for aarch64 - zfs_shani_available() for x86_64 Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-Authored-By: Sebastian Gottschall <s.gottschall@dd-wrt.com> Closes #13741	2023-03-02 13:52:04 -08:00
Tino Reichardt	589143c225	Add SHA2 SIMD feature tests for FreeBSD These are added: - zfs_neon_available() for arm and aarch64 - zfs_sha256_available() for arm and aarch64 - zfs_sha512_available() for aarch64 - zfs_shani_available() for x86_64 Changes: - simd_powerpc.h: change license from CDDL to BSD Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:51:56 -08:00
Tino Reichardt	3e254aaad0	Remove old or redundant SHA2 files We had three sha2.h headers in different places. The FreeBSD version, the Linux version and the generic solaris version. The only assembly used for acceleration was some old x86-64 openssl implementation for sha256 within the icp module. For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS, these files got removed also. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:50:21 -08:00
Alexander Motin	5f42d1dbf2	System-wide speculative prefetch limit. With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14516	2023-03-01 15:27:40 -08:00
Richard Yao	4c856fb333	Fix data race between zil_commit() and zil_suspend() openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <arun.kv@datacore.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14514	2023-03-01 13:23:09 -08:00
Richard Yao	2f76797ad9	Linux: Assert mutex is held in mutex_exit() A spurious mutex_exit() in a development branch caused weird issues until I identified it. An assertion prior to mutex_exit() would have caught it. Rather than adding assertions before invocations of mutex_exit() in the code, let us simply add an assertion to mutex_exit(). It is cheap and will likely improve developer productivity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14541	2023-02-28 17:27:20 -08:00
George Amanakis	13ff72ba0a	Revert zfeature_active() to static Commit `34ce4c4` made zfeature_active() non-static. This is not required. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14546	2023-02-28 14:03:52 -08:00
Richard Yao	bff26b0220	Skip memory allocation when compressing holes Hole detection in the zio compression code allows us to opportunistically skip compression on holes. We can go a step further by not doing memory allocations on holes either. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14500	2023-02-27 14:41:02 -08:00
Dimitry Andric	bf1bec394e	Use .section .rodata instead of .rodata on FreeBSD In commit `0a5b942d4` the FreeBSD SECTION_STATIC macro was set to ".rodata". This assembler directive is supported by LLVM (as a convenience alias for ".section .rodata") by not by GNU as. This caused the FreeBSD builds that are done with gcc to fail. Therefore, use ".section .rodata" instead, similar to the other asm_linkage.h headers. Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Dimitry Andric <dimitry@andric.com> Closes #14526	2023-02-24 16:45:48 -08:00
Brian Behlendorf	89cd2197b9	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-02-23 10:57:24 -08:00
Brian Behlendorf	57cfae4a2f	zdb: zero-pad checksum output follow up Apply zero padding for checksums consistently. The SNPRINTF_BLKPTR macro was not updated in commit `ac7648179c` which results in the `cli_root/zdb/zdb_checksum.ksh` test case reliably failing. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14497	2023-02-15 09:06:29 -08:00

1 2 3 4 5 ...

2258 Commits