mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Rob Norris	a1832d1ecb	config: remove HAVE_KERNEL_GET_ACL_HANDLE_CACHE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	9b6f93a72f	config: remove HAVE_GROUP_INFO_GID Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	bbc52ed501	config: remove HAVE_CPU_HOTPLUG Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	afcc0fb0fa	config: remove HAVE_1ARG_SUBMIT_BIO Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	9a1c7240ba	config: remove HAVE_RENAME2_OPERATIONS_WRAPPER Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	230bc538cb	config: remove HAVE_VFS_FILE_OPERATIONS_EXTEND Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	9914684d36	config: remove HAVE_NEW_SYNC_READ Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	733317966f	config: remove HAVE_XATTR_(GET\|SET\|LIST)_DENTRY Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	eb73000dbb	config: remove HAVE_WAIT_ON_BIT_ACTION Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	a987057c67	config: remove HAVE_VFS_DIRECT_IO_ITER_RW_OFFSET Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	c9e8d0e0b5	config: remove HAVE_PUT_LINK_NAMEIDATA Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	99c143a5a1	config: remove HAVE_FOLLOW_LINK_NAMEIDATA Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	ed048fdc5b	config: remove HAVE_D_REVALIDATE_NAMEIDATA Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	ec6ba977b7	config: remove HAVE_3ARGS_VFS_GETATTR Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	1a64c06ec0	config: remove SHRINK_CONTROL_HAS_NID Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	72be1f4062	config: remove HAVE_VFS_RW_ITERATE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	f3d30f1ce0	config: remove HAVE_USER_NS_COMMON_INUM Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	b545b07b2f	config: remove HAVE_SPLIT_SHRINKER_CALLBACK and HAVE_SINGLE_SHRINKER_CALLBACK Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	d60d4ad809	config: remove HAVE_SET_CACHED_ACL_USABLE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	6840e3b18b	config: remove HAVE_SET_ACL Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	3d37b1d6d4	config: remove HAVE_POSIX_ACL_RELEASE and HAVE_POSIX_ACL_RELEASE_GPL_ONLY Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:50 -07:00
Rob Norris	583e2e25b9	config: remove HAVE_PERCPU_COUNTER_INIT_WITH_GFP Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	f07485c46e	config: remove HAVE_LINUX_BLK_CGROUP_HEADER Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	d4bbe2ff38	config: remove HAVE_IO_SCHEDULE_TIMEOUT Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	714d7666e5	config: remove HAVE_INODE_SET_FLAGS Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	cf006e3496	config: remove HAVE_GENERIC_WRITE_CHECKS_KIOCB Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	7af642af4d	config: remove HAVE_FSYNC_RANGE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	525f06b5f6	config: remove HAVE_FALLOC_FL_ZERO_RANGE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	f70ffacdfc	config: remove HAVE_ENCODE_FH_WITH_INODE Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	8e002ee26e	config: remove HAVE_D_PRUNE_ALIASES Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	92f7ec6075	config: remove HAVE_DIRTY_INODE_WITH_FLAGS Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:49 -07:00
Rob Norris	233bed67a8	config: remove HAVE_1ARG_BIO_END_IO_T Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16479	2024-09-18 11:23:40 -07:00
Alexander Motin	ac04407ffe	Remove extra newline from spa_set_allocator(). zfs_dbgmsg() does not need newline at the end of the message. While there, slightly update/sync FreeBSD __dprintf(). Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16536	2024-09-17 13:15:42 -07:00
Brian Atkinson	a10e552b99	Adding Direct IO Support Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads. O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests. This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will not be synced until the associated TXG is synced. For both O_DIRECT read and write request the offset and request sizes, at a minimum, must be PAGE_SIZE aligned. In the event they are not, then EINVAL is returned unless the direct property is set to always (see below). For O_DIRECT writes: The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path. In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded from the ARC forcing all further reads to retrieve the data from disk. For O_DIRECT reads: The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered (in the ARC) it will just be copied from the ARC into the user buffer. For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in the event that file contents are mmap'ed. In this case, all requests that are at least PAGE_SIZE aligned will just fall back to the buffered paths. If the request however is not PAGE_SIZE aligned, EINVAL will be returned as always regardless if the file's contents are mmap'ed. Since O_DIRECT writes go through the normal ZIO pipeline, the following operations are supported just as with normal buffered writes: Checksum Compression Encryption Erasure Coding There is one caveat for the data integrity of O_DIRECT writes that is distinct for each of the OS's supported by ZFS. FreeBSD - FreeBSD is able to place user pages under write protection so any data in the user buffers and written directly down to the VDEV disks is guaranteed to not change. There is no concern with data integrity and O_DIRECT writes. Linux - Linux is not able to place anonymous user pages under write protection. Because of this, if the user decides to manipulate the page contents while the write operation is occurring, data integrity can not be guaranteed. However, there is a module parameter `zfs_vdev_direct_write_verify` that controls the if a O_DIRECT writes that can occur to a top-level VDEV before a checksum verify is run before the contents of the I/O buffer are committed to disk. In the event of a checksum verification failure the write will return EIO. The number of O_DIRECT write checksum verification errors can be observed by doing `zpool status -d`, which will list all verification errors that have occurred on a top-level VDEV. Along with `zpool status`, a ZED event will be issues as `dio_verify` when a checksum verification error occurs. ZVOLs and dedup is not currently supported with Direct I/O. A new dataset property `direct` has been added with the following 3 allowable values: disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request. standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used. always - Treats every write/read IO request as though it passed O_DIRECT and will do O_DIRECT if the alignment restrictions are met otherwise will redirect through the ARC. This property will not allow a request to fail. There is also a module parameter zfs_dio_enabled that can be used to force all reads and writes through the ARC. By setting this module parameter to 0, it mimics as if the direct dataset property is set to disabled. Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov> Closes #10018	2024-09-14 13:47:59 -07:00
Tino Reichardt	1713aa7b4d	Remove set but not used variable in ddt.c (#16522 ) module/zfs/ddt.c:2612:6: error: variable 'total' set but not used Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-09-10 12:46:50 -07:00
Alan Somers	308f7c2f14	Fix an uninitialized data access (#16511 ) zfs_acl_node_alloc allocates an uninitialized data buffer, but upstack zfs_acl_chmod only partially initializes it. KMSAN reported that this memory remained uninitialized at the point when it was read by lzjb_compress, which suggests a possible kernel memory disclosure bug. The full KMSAN warning may be found in the PR. https://github.com/openzfs/zfs/pull/16511 Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-09-10 09:08:45 -07:00
Rob Norris	8be2f4c3d2	zio_resume: log when unsuspending the pool (#16485 ) When reviewing logs after a failure, its useful to see where unsuspend/resume was requested. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-09-09 17:21:20 -07:00
Rob Norris	b109925820	spa_prop_get: require caller to supply output nvlist All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their own preallocated nvlist (except ztest), so we can remove the option to have them allocate one if none is supplied. This sidesteps a bug in spa_prop_get(), where the error var wasn't initialised, which could lead to the provided nvlist being freed at the end. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16505	2024-09-06 08:45:58 -07:00
Rob Norris	82ff9aafd6	value strings: pretty printers for flags and enums This adds zfs_valstr, a collection of pretty printers for bitfields and enums. These are useful in debugging, logging and other display contexts where raw values are difficult for the untrained (or even trained!) eye to decipher. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-09-05 13:40:05 -07:00
Don Brady	d4d79451cb	Add DDT prune command Requires the new 'flat' physical data which has the start time for a class entry. The amount to prune can be based on a target percentage of the unique entries or based on the age (i.e., every entry older than N days). Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16277	2024-09-04 14:17:02 -07:00
Rob Norris	4a4f7b019f	zdb: rework dedup accounting for log, quota and prune The simplest thing first: add the FDT and log objects to the list of objects to be considered when checking for leaks. The rest is based on a conceptual change in all of this patch stack: a block on disk with a 'D' bit is not necessarily in the DDT at all (pruned), or in the DDT ZAPs (still on the log). As such, walking the DDT up front is difficult (for all the reasons that walking an unflushed log is difficult) and not really useful, since it's not a reflection of what's on disk anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #16277	2024-09-04 14:16:42 -07:00
Seth Hoffert	bf8c61f489	Remove unused sysctl node PR #14953 removed vdev-level read cache but accidentally left this sysctl node behind. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Seth Hoffert <seth.hoffert@gmail.com> Closes #16493	2024-09-03 17:52:33 -07:00
Rob Norris	50b32cb925	fm: pass io_flags through events & zed as uint64_t In `4938d01db` (#14086) zio_flag_t was converted from an enum (generally signed 32-bit) to a uint64_t. The corresponding change wasn't made to the error reporting subsystem, limiting the error flags being delivered to zed to 32 bits. This bumps the whole pipeline to use uint64s. A tiny bit of compatibility is added for newer zed working agsinst an older kernel module, because its easy to do and misdetecting scrub/resilver errors and taking action is potentially dangerous. Making it work for new kernel modules against older zed seems to be far more invasive for far less benefit, so I have not. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16469	2024-08-26 17:39:13 -07:00
Jitendra Patidar	73866cf346	Fix issig() to check signal_pending after dequeue SIGSTOP/SIGTSTP When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0. But process could still have another signal pending after dequeue. So, after dequeue, check and return 1, if signal_pending. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #16464	2024-08-26 17:36:49 -07:00
Mateusz Piotrowski	6be8bf5552	zpool: Provide GUID to zpool-reguid(8) with -g (#16239 ) This commit extends the zpool-reguid(8) command with a -g flag, which allows the user to specify the GUID to set. This change also adds some general tests for zpool-reguid(8). Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-26 09:27:24 -07:00
Rob Norris	2420ee6e12	spl-taskq: fix task counts for delayed and cancelled tasks Dispatched delayed tasks were not added to tasks_total, and cancelled tasks were not removed. This notably could make tasks_total go to UNIT64_MAX, but just generally meant the count could be wrong. So lets not! Sponsored-by: Klara, Inc. Sponsored-by: Syneto Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16473	2024-08-23 10:40:45 -07:00
Low-power	34118eac06	Make mount.zfs(8) calling zfs_mount_at for legacy mounts as well Commit 329e2ffa4bca456e65c3db7f5c5c04931c551b61 has made mount.zfs(8) to call libzfs function 'zfs_mount_at', in order to propagate dataset properties into mount options. This fix however, is limited to a special use case where mount.zfs(8) is used in initrd with option '-o zfsutil'. If either initrd or the user need to use mount.zfs(8) to mount a file system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and the original issue #7947 will still happen. Since the existing code already excluded the possibility of calling 'zfs_mount_at' when it was invoked as a helper program from zfs(8), by checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to avoid calling 'zfs_mount_at' without '-o zfsutil'. An exception however, is when mount.zfs(8) was invoked with '-o remount' to update the mount options for an existing mount point. In this case call mount(2) directly without modifying the mount options passed from command line. Furthermore, don't run mount.zfs(8) helper for automounting snapshot. The above change to make mount.zfs(8) to call 'zfs_mount_at' apparently caused it to trigger an automount for the snapshot directory. When the helper was invoked as a result of a snapshot automount, an infinite recursion will occur. Since the need of invoking user mode mount(8) for automounting was to overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8) without the mount.zfs(8) helper by adding option '-i'. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: WHR <whr@rivoreo.one> Closes #16393	2024-08-23 10:39:09 -07:00
Rob Norris	a9c94bea9f	zio_compress_data: limit dest length to ABD size Some callers (eg `do_corrective_recv()`) pass in a dest buffer much smaller than the wanted 87.5% of the source buffer, because the incoming abd is larger than the source data and they "know" what the decompressed size with be. However, `abd_borrow_buf()` rightly asserts if we try to borrow more than is available, so these callers fail. Previously when all we had was a dest buffer, we didn't know how big it was, so we couldn't do anything. Now we have a dest abd, with a size, so we can clamp dest size to the abd size. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	f62e6e1f98	compress: change zio_compress API to use ABDs This commit changes the frontend zio_compress_data and zio_decompress_data APIs to take ABD points instead of buffer pointers. All callers are updated to match. Any that already have an appropriate ABD nearby now use it directly, while at the rest we create an one. Internally, the ABDs are passed through to the provider directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	d3c12383c9	compress: change compression providers API to use ABDs This commit changes the provider compress and decompress API to take ABD pointers instead of buffer pointers for both data source and destination. It then updates all providers to match. This doesn't actually change the providers to do chunked compression, just changes the API to allow such an update in the future. Helper macros are added to easily adapt the ABD functions to their buffer-based implementations. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	522816498c	compress: standardise names of compression functions This is mostly to make searching easier. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	e119483a95	compress: remove zio_decompress_data_buf Nothing uses it anymore! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	5eede0d5fd	compress: rework callers to always use the zio_compress calls This will make future refactoring easier. There are two we can't change for the moment, because zio_compress_data does hole detection & collapsing which zio_decompress_data does not actually know how to handle. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	ba2209ec9e	abd_get_from_buf_struct: wrap existing buf with ABD stored on stack This allows a simple "wrapping" ABD for an existing linear buffer to be allocated on the stack, avoiding an allocation. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2024-08-22 16:22:24 -07:00
Rob Norris	5b9e695392	abd_os: break out platform-specific header parts Removing the platform #ifdefs from shared headers in favour of per-platform headers. Makes abd_t much leaner, among other things. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:18 -07:00
Rob Norris	7a5b4355e2	abd_os: split userspace and Linux kernel code The Linux abd_os.c serves double-duty as the userspace scatter abd implementation, by carrying an emulation of kernel scatterlists. This commit lifts common and userspace-specific parts out into a separate abd_os.c for libzpool. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:13 -07:00
Rob Norris	2b7d9a7863	zio: no alloc canary in userspace Makes it harder to use memory debuggers like valgrind directly, because they can't see canary overruns. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:37:07 -07:00
Rob Norris	b3f4e4e1ec	abd: remove ABD_FLAG_ZEROS Nothing ever checks it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16253	2024-08-21 13:36:24 -07:00
shodanshok	bbe8512a93	Ignore zfs_arc_shrinker_limit in direct reclaim mode zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse due to excessive memory reclaim. However, when the kernel is in direct reclaim mode (ie: low on memory), limiting ARC reclaim increases OOM risk. This is especially true on system without (or with inadequate) swap. This patch ignores zfs_arc_shrinker_limit when the kernel is in direct reclaim mode, avoiding most OOM. It also restores "echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop (almost) all ARC. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #16313	2024-08-21 10:00:33 -07:00
Ameer Hamza	a2c4e95cfd	linux/zvol_os.c: cleanup limits for non-blk mq case Rob Noris suggested that we could clean up redundant limits for the case of non-blk mq scenario. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-20 17:16:08 -07:00
Ameer Hamza	8e6a9aabb1	linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernel In kernels 6.8 and later, the zvol block device is allocated with qlimits passed during initialization. However, the zvol driver does not set `max_hw_discard_sectors`, which is necessary to properly initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting `max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16462	2024-08-20 17:14:44 -07:00
Rob Norris	816d2b2bfc	spl-proc: remove old taskq stats These had minimal useful information for the admin, didn't work properly in some places, and knew far too much about taskq internals. With the new stats available, these should never be needed anymore. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:45 -07:00
Rob Norris	3f8fd3cae0	spl-taskq: summary stats for all taskqs This adds /proc/spl/kstats/taskq/summary, which attempts to show a useful subset of stats for all taskqs in the system. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:41 -07:00
Rob Norris	db40fe4cf6	spl-taskq: per-taskq kstats This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq, one file per taskq, named for the taskq name.instance. These include a small amount of info about the taskq config, the current state of the threads and queues, and various counters for thread and queue activity since the taskq was created. To assist with decrementing queue size counters, the list an entry is on is encoded in spare bits in the entry flags. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:50:35 -07:00
Rob Norris	f0ad031cd9	spl-generic: bring up kstats subsystem before taskq For spl-taskq to use the kstats infrastructure, it has to be available first. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Syneto Closes #16171	2024-08-19 09:49:28 -07:00
Chunwei Chen	06a7b123ac	Skip ro check for snaps when multi-mount Skip ro check for snapshots since they are always ro regardless if ro flag is passed by mount or not. This allows multi-mounting snapshots without requiring to specify ro flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #16299	2024-08-19 09:42:17 -07:00
shodanshok	77a797a382	Enable L2 cache of all (MRU+MFU) metadata but MFU data only `l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU data and metadata. However it can be useful to cache as much metadata as possible while, at the same time, restricting data cache to MFU buffers only. This patch allow for such behavior by setting `l2arc_mfuonly` to 2 (or higher). The list of possible values is the following: 0: cache both MRU and MFU for both data and metadata; 1: cache only MFU for both data and metadata; 2: cache both MRU and MFU for metadata, but only MFU for data. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #16343 Closes #16402	2024-08-16 13:34:07 -07:00
Rob Norris	0d2707815d	ddt: lookup and log stats Adds per-DDT stats counting lookups and where they were serviced from (either log or backing zap), number of log entries in memory, and flow rates. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:51 -07:00
Rob Norris	a1902f4950	ddt: block scan until log is flushed, and flush aggressively The dedup log does not have a stable cursor, so its not possible to persist our current scan location within it across pool reloads. Beccause of this, when walking (scanning), we can't treat it like just another source of dedup entries. Instead, when a scan is wanted, we switch to an aggressive flushing mode, pushing out entries older than the scan start txg as fast as we can, before starting the scan proper. Entries after the scan start txg will be handled via other methods; the DDT ZAPs and logs will be written as normal, and blocks not seen yet will be offered to the scan machinery as normal. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:43 -07:00
Rob Norris	cd69ba3d49	ddt: dedup log Adds a log/journal to dedup. At the end of txg, instead of writing the entry directly to the ZAP, instead its adding to an in-memory tree and appended to an on-disk object. The on-disk object is only read at import, to reload the in-memory tree. Lookups first go the the log tree before going to the ZAP, so recently-used entries will remain close by in memory. This vastly reduces overhead from dedup IO, as it will not have to do so many read/update/write cycles on ZAP leaf nodes. A flushing facility is added at end of txg, to push logged entries out to the ZAP. There's actually two separate "logs" (in-memory tree and on-disk object), one active (recieving updated entries) and one flushing (writing out to disk). These are swapped (ie flushing begins) based on memory used by the in-memory log trees and time since we last flushed something. The flushing facility monitors the amount of entries coming in and being flushed out, and calibrates itself to try to flush enough each txg to keep up with the ingest rate without competing too much with other IO. Multiple tuneables are provided to control the flushing facility. All the histograms and stats are update to accomodate the log as a separate entry store. zdb gains knowledge of how to count them and dump them. Documentation included! Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:35 -07:00
Rob Norris	cbb9ef0a4c	ddt: tuneable to override copies= on dedup metadata objects All objects stored in the MOS get copies=3. For a large dedup table, this requires significant extra IO and disk space, when its not really necessary - the dedup table itself isn't needed to read or write data, only to keep data usage down. Losing the dedup table does not render the pool unusable, it just messes up the accounting somewhat. This adds a dmu_ddt_copies tuneable. When set to 0, the existing behaviour is used. When set higher, dedup table blocks (ZAP and log) will have this many copies rather than the usual 3, while indirect blocks will have one more again. This is a tuneable for now mostly for testing. Losing a dedup table can cause blocks to be leaked, and we currently have no facilities to repair that. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:27 -07:00
Rob Norris	592f38900d	ddt: compare keys 64-bits at a time, trying to match ZAP order This yields substantial performance improvements when we only write out some small % of entries at a time, as it will cause entries that will go into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and so touch fewer blocks. Without this, the distribution is an even spread, so we touch a lot more ZAP leaf nodes for any given number of entries. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:03:19 -07:00
Rob Norris	27e9cb5f80	ddt: cleanup the stats & histogram code Both the API and the code were kinda mangled and I was really struggling to follow it. The worst offender was the old ddt_stat_add(); after fixing it up the rest of the changes are mostly knock-on effects and targets of opportunity. Note that the old ddt_stat_add() was safe against overflows - it could produce crazy numbers, but the compiler wouldn't do anything stupid. The assertions in ddt_stat_sub() go a lot of the way to protecting against this; getting in a position where overflows are a problem is definitely a programming error. Also expanding ddt_stat_add() and ddt_histogram_empty() produces less efficient assembly. I'm not bothered about this right now though; these should not be hot functions, and if they are we'll optimise them later. If we have to go back to the old form, we'll comment it like crazy. Finally, I've removed the assertion that the bucket will never be negative, as it will soon be possible to have entries with zero refcounts: an entry for a block that is no longer on the pool, but is on the log waiting to be synced out. It might be better to have a separate bucket for these, since they're still using real space on disk, but ultimately these stats are driving UI, and for now I've chosen to keep them matching how they've looked in the past, as well as match the operators mental model - pool usage is managed elsewhere. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15895	2024-08-16 12:02:56 -07:00
Rob Norris	f4aeb23f52	ddt: add "flat phys" feature Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893	2024-08-16 12:02:39 -07:00
Rob Norris	0ba5f503c5	ddt: slim down ddt_entry_t This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893	2024-08-16 12:02:31 -07:00
Rob Norris	4d686c3da5	ddt: introduce lightweight entry The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893	2024-08-16 12:02:22 -07:00
Rob Norris	d17ab631a9	ddt: rework access to phys array slots The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893	2024-08-16 12:02:02 -07:00
Rob Norris	d63f5d7e50	zdb: rework DDT block count and leak check to just count the blocks The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892	2024-08-16 12:01:41 -07:00
Rob Norris	db2b1fdb79	ddt: add FDT feature and support for legacy and new on-disk formats This is the supporting infrastructure for the upcoming dedup features. Traditionally, dedup objects live directly in the MOS root. While their details vary (checksum, type and class), they are all the same "kind" of thing - a store of dedup entries. The new features are more varied than that, and are better thought of as a set of related stores for the overall state of a dedup table. This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will cause new DDTs to be created as a ZAP in the MOS root, named DDT-<checksum>. The is used as the root object for the normal type/class store objects, but will also be a place for any storage required by new features. This commit adds two new fields to ddt_t, for version and flags. These are intended to describe the structure and features of the overall dedup table, and are stored as-is in the DDT root. In this commit, flags are always zero, but the intent is that they can be used to hang optional logic or state onto for new dedup features. Version is always 1. For a "legacy" dedup table, where no DDT root directory exists, the version will be 0. ddt_configure() is expected to determine the version and flags features currently in operation based on whether or not the fast_dedup feature is enabled, and from what's available on disk. In this way, its possible to support both old and new tables. This also provides a migration path. A legacy setup can be upgraded to FDT by creating the DDT root ZAP, moving the existing objects into it, and setting version and flags appropriately. There's no support for that here, but it would be straightforward to add later and allows the possibility that newer features could be applied to existing dedup tables. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15892	2024-08-16 11:58:59 -07:00
Ameer Hamza	bdf4d6be1d	linux/zvol_os: fix zvol queue limits initialization zvol queue limits initialization depends on `zv_volblocksize`, but it is initialized later, leading to several limits being initialized with incorrect values, including `max_discard_*` limits. This also causes `blkdiscard` command to consistently fail, as `blk_ioctl_discard` reads `bdev_max_discard_sectors()` limits as 0, leading to failure. The fix is straightforward: initialize `zv->zv_volblocksize` early, before setting the queue limits. This PR should fix `zvol/zvol_misc/zvol_misc_trim` failure on recent PRs, as the test case issues `blkdiscard` for a zvol. Additionally, `zvol_misc_trim` was recently enabled in `6c7d41a`, which is why the issue wasn't identified earlier. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16454	2024-08-15 14:29:50 -07:00
Justin Gottula	5807de90a1	Fix null ptr deref when renaming a zvol with snaps and snapdev=visible (#16316 ) If a zvol is renamed, and it has one or more snapshots, and snapdev=visible is true for the zvol, then the rename causes a kernel null pointer dereference error. This has the effect (on Linux, anyway) of killing the z_zvol taskq kthread, with locks still held; which in turn causes a variety of zvol-related operations afterward to hang indefinitely (such as udev workers, among other things). The problem occurs because of an oversight in #15486 (`e36ff84c33`). As documented in dataset_kstats_create, some datasets may not actually have kstats allocated for them; and at least at the present time, this is true for snapshots. In practical terms, this means that for snapshots, dk->dk_kstats will be NULL. The dataset_kstats_rename function introduced in the patch above does not first check whether dk->dk_kstats is NULL before proceeding, unlike e.g. the nearby dataset_kstats_update_* functions. In the very particular circumstance in which a zvol is renamed, AND that zvol has one or more snapshots, AND that zvol also has snapdev=visible, zvol_rename_minors_impl will loop over not just the zvol dataset itself, but each of the zvol's snapshots as well, so that their device nodes will be renamed as well. This results in dataset_kstats_create being called for snapshots, where, as we've established, dk->dk_kstats is NULL. Fix this by simply adding a NULL check before doing anything in dataset_kstats_rename. This still allows the dataset_name kstat value for the zvol to be updated (as was the intent of the original patch), and merely blocks attempts by the code to act upon the zvol's non-kstat-having snapshots. If at some future time, kstats are added for snapshots, then things should work as intended in that case as well. Signed-off-by: Justin Gottula <justin@jgottula.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Alan Somers <asomers@gmail.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-15 14:13:18 -07:00
Tony Hutter	fb432660c3	Linux 6.10 compat: Fix zvol NULL pointer deference zvol_alloc_non_blk_mq()->blk_queue_set_write_cache() needs the disk queue setup to prevent a NULL pointer deference. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16453	2024-08-15 14:05:58 -07:00
Tony Hutter	f2f4ada240	Linux 6.10 compat: fix rpm-kmod and builtin The 6.10 kernel broke our rpm-kmod builds. The 6.10 kernel really wants the source files in the same directory as the object files. This workaround makes rpm-kmod work again. It also updates the builtin kernel codepath to work correctly with 6.10. See kernel commits: b1992c3772e6 kbuild: use $(src) instead of $(srctree)/$(src) for source directory 9a0ebe5011f4 kbuild: use $(obj)/ instead of $(src)/ for common pattern rules Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16439 Closes #16450	2024-08-15 14:00:18 -07:00
Ameer Hamza	963e6c9f3f	Fix incorrect error report on vdev attach/replace Report the correct error message in libzfs when attaching/replacing a vdev with a higher ashift. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16449	2024-08-15 12:39:44 -07:00
Gleb Smirnoff	83f359245a	FreeBSD: fix build without kernel option MAC Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gleb Smirnoff <glebius@FreeBSD.org> Closes #16446	2024-08-15 09:08:43 -07:00
Jitendra Patidar	d2ccc21552	Fix projid accounting for xattr objects zpool upgraded with 'feature@project_quota' needs re-layout of SA's to fix the SA_ZPL_PROJID at SA_PROJID_OFFSET (128). Its necessary for the correct accounting of object usage against its projid. Old object (created before upgrade) when gets a projid assigned, its SA gets re-layout via sa_add_projid(). If object has xattr dir, SA of xattr dir also gets re-layout. But SA re-layout of xattr objects inside a xattr dir is not done. Fix zfs_setattr_dir() to re-layout SA's on xattr objects, when setting projid on old xattr object (created before upgrade). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #16355 Closes #16356	2024-08-14 17:59:19 -07:00
Paul Dagnelie	244ea5c488	Add missing kstats to dataset kstats Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #16431	2024-08-14 14:18:46 -07:00
Rob Norris	2633075e09	Linux 6.11: avoid passing "end" sentinel to register_sysctl() Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:22 -07:00
Rob Norris	3abffc8781	Linux 6.11: add compat macro for page_mapping() Since the change to folios it has just been a wrapper anyway. Linux has removed their wrapper, so we add one. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:18 -07:00
Rob Norris	f5236fe47a	Linux 6.11: add more queue_limit fields with removed setters These fields are very old, so no detection necessary; we just move them into the limit setup functions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:13 -07:00
Rob Norris	0b741a0351	Linux 6.11: IO stats is now a queue feature flag Apply them with with the rest of the settings. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:08 -07:00
Rob Norris	22619523f6	Linux 6.11: first arg to proc_handler is now const Detect it, and use a macro to make sure we always match the prototype. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:01 -07:00
Rob Norris	e95b732e49	Linux 6.11: enable queue flush through queue limits In 6.11 struct queue_limits gains a 'features' field, where, among other things, flush and write-cache are enabled. Detect it and use it. Along the way, the blk_queue_set_write_cache() compat wrapper gets a little cleanup. Since both flags are alway set together, its now a single bool. Also the very very ancient version that sets q->flush_flags directly couldn't actually turn it off, so I've fixed that. Not that we use it, but still. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:46:41 -07:00
Rob Norris	767b37019f	linux/zvol_os: tidy and document queue limit/config setup It gets hairier again in Linux 6.11, so I want some actual theory of operation laid out for next time. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:45:12 -07:00
Alan Somers	ed0db1cc8b	Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in #7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: https://github.com/openzfs/zfs/pull/9115 [^2]: https://github.com/openzfs/zfs/pull/8987 [^3]: `10b9d77bf1` [^4]: `5aaeed5c61` [^5]: https://github.com/openzfs/zfs/pull/7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #16268	2024-08-09 14:32:59 -07:00
Rob Norris	b0bf14cdb5	abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero() It's now the caller's responsibility do special handling for holes if that's something it wants. This also makes zio_compress_data() and zio_decompress_data() properly the inverse of each other. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Jason Lee <jasonlee@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16326	2024-08-09 14:30:26 -07:00
Alexander Motin	3ae05e34e5	Linux: Make zfs_prune() fair on NUMA systems Previous code evicted nr_to_scan items from each NUMA node. This not only multiplied the eviction by the number of nodes, but could exhaust the smaller ones, evicting inodes used by acive workload and requiring their immediate recreation. This patch spreads the requested eviction between all NUMA nodes proportionally to their evictable counts, which should be closer to expected LRU logic. See kernel's super_cache_scan() as a similar logic example. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16397	2024-08-08 15:33:36 -07:00
Alexander Motin	5b9f3b7664	Soften pruning threshold on not evictable metadata Previous code pruned 10% of dnodes once 3/4 of metadata appeared unevictable. On workloads with many millions of dnodes and little other metadata it creates significant load spikes for many seconds straight. This change instead gradually increases pruning as unevictable metadata grow above the 3/4, which may allow it to stabilize at some level. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16401	2024-08-08 15:26:35 -07:00
Alexander Motin	aef452f108	Improve zfs_blkptr_verify() - Skip config lock enter/exit for embedded blocks. They have no DVAs, so there is nothing to check under the lock. - Skip CHECKSUM check and properly check PSIZE for embedded blocks. - Add static branch predictions for unlikely conditions. - Do not verify DVAs for blocks already in ARC. ARC hit already "verified" the first (often the only) DVA, and it does not worth to enter/exit config lock for nothing. Some profiles show me up to 3% of CPU saving from this change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16387	2024-08-08 15:25:10 -07:00
Ameer Hamza	5536c0dee2	Sync AUX label during pool import Spare and l2cache vdev labels are not updated during import. Therefore, if disk paths are updated between pool export and import, the AUX label still shows the old paths. This patch syncs the AUX label during import to show the correct path information. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15817	2024-08-08 15:16:46 -07:00

1 2 3 4 5 ...

4689 Commits