mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-14 06:16:17 +03:00

Author	SHA1	Message	Date
Akash B	b6e8db509d	zpool/zfs: Add '-a\|--all' option to scrub, trim, initialize Add support for the '-a \| --all' option to perform trim, scrub, and initialize operations on all pools. Previously, specifying a pool name was mandatory for these operations. With this enhancement, users can now execute these operations across all pools at once, without needing to manually iterate over each pool from the command line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Akash B <akash-b@hpe.com> Closes #17524	2025-07-29 14:50:44 -07:00
Rob Norris	00ce064d8f	spa: update blkptr diagram to include vdev padding on encrypted blocks Probably just an oversight in `4d044c4c1d`. SPA_VDEVBITS is always 24, regardless of whether or not the bp is for an encrypted block, and it wouldn't make sense for it to be different anyway. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17564	2025-07-24 09:50:23 -04:00
Rob Norris	9292071565	linux/kmem: remove HAVE_ATOMIC64_T and kmem_alloc_used wrappers Seems like we haven't set it since the SPL was pulled into the main ZFS tree. In removing the define, I've taken the 64-bit version (ie the one that _hasn't_ been running since back then) because it looks like its closer to the intended width by the way its used. Since the macros ar eno longer needed as a selector, pull those too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:08:07 -07:00
Rob Norris	1c483cf3d0	linux/kmem: remove long-obsolete __GFP compat flags Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:53 -07:00
Rob Norris	96d20d7d59	linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:36 -07:00
shodanshok	a7a144e655	enforce arc_dnode_limit Linux kernel shrinker in the context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17487 Closes #17542	2025-07-21 10:32:01 -07:00
Alexander Motin	be1e991a1a	Allow and prefer special vdevs as ZIL Before this change ZIL blocks were allocated only from normal or SLOG vdevs. In typical situation when special vdevs are SSDs and normal are HDDs it could cause weird inversions when data blocks are written to SSDs, but ZIL referencing them to HDDs. This change assumes that special vdevs typically have much better (or at least not worse) latency than normal, and so in absence of SLOGs should store ZIL blocks. It means similar to normal vdevs introduction of special embedded log allocation class and updating the allocation fallback order to: SLOG -> special embedded log -> special -> normal embedded log -> normal. The code tries to guess whether data block is going to be written to normal or special vdev (it can not be done precisely before compression) and prefer indirect writes for blocks written to a special vdev to avoid double-write. For blocks that are going to be written to normal vdev, special vdev by default plays as SLOG, reducing write latency by the cost of higher special vdev wear, but it is tunable via module parameter. This should allow HDD pools with decent SSD as special vdev to work under synchronous workloads without requiring additional SLOG SSD, impractical in many scenarios. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17505	2025-07-18 18:44:14 -07:00
Chunwei Chen	2669b00f13	Define sops->free_inode() to prevent use-after-free during lookup On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can be dereferenced without refcounts and locks. For this reason, dentry and inode must only be freed after RCU grace period. However, zfs currently frees inode in zfs_inode_destroy synchronously and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on Linux 5.2 and after, if we define sops->free_inode(), the kernel will do call_rcu() for us. This issue may be triggered more easily with init_on_free=1 boot parameter: BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:selinux_inode_permission+0x10e/0x1c0 Call Trace: ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? security_inode_permission+0x37/0x60 ? __die_body.cold+0x8/0xd ? no_context+0x113/0x220 ? exc_page_fault+0x6d/0x130 ? asm_exc_page_fault+0x1e/0x30 ? selinux_inode_permission+0x10e/0x1c0 security_inode_permission+0x37/0x60 link_path_walk.part.0.constprop.0+0xb5/0x360 ? path_init+0x27d/0x3c0 path_lookupat+0x3e/0x1a0 filename_lookup+0xc0/0x1d0 ? __check_object_size.part.0+0x123/0x150 ? strncpy_from_user+0x4e/0x130 ? getname_flags.part.0+0x4b/0x1c0 vfs_statx+0x72/0x120 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120 __do_sys_newlstat+0x39/0x70 ? __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x62/0xc7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #17546	2025-07-18 08:45:13 -07:00
Rob Norris	d323fbf49c	FreeBSD: zfs_putpages: don't undirty pages until after write completes In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (`238eab7dc1`) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:15 -07:00
Mark Johnston	ee2a2d941a	Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit `238eab7dc1`. Reported by: Ihor Antonov <ngor@hugpoint.tech> Discussed with: Rob Norris <rob.norris@klarasystems.com> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Ported-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:11 -07:00
Rob Norris	fce18e04d5	libzpool: tunable-based option interface for zdb/ztest Removes the old dlsym() based option setter and adds a new function handle_tunable_option() that can set, get and list all the tunables in the system. And then wire it up to zdb and ztest. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:47:03 -07:00
Rob Norris	3a494c6d2a	mod.h: make consistent across all three platforms mod.h only exists to include the platform-specific mod_os.h, so we can get rid of it and just call the platform header mod.h. Then, create a libspl mod.h, and move the relevant items to it so we can start building on it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:46:14 -07:00
Paul Dagnelie	a981cb69e4	Implement dynamic gang header sizes ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:02:53 -07:00
Paul Dagnelie	e845be28e7	Add no-upgrade featureflag Adds a featureflag that is not enabled during upgrades unless listed explicitly. This is useful for features that could cause issues unless applied carefully; for example, a feature that could make a root pool unbootable if bootloaders don't yet have support for it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:01:59 -07:00
Rob Norris	6af8db61b1	metaslab: don't pass whole zio to throttle reserve APIs They only need a couple of fields, and passing the whole thing just invites fiddling around inside it, like modifying flags, which then makes it much harder to understand the zio state from inside zio.c. We move the flag update to just after a successful throttle in zio.c. Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED Better describes what it means, and makes it look less like IO_IS_ALLOCATING, which means something different. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17508	2025-07-04 23:22:22 -04:00
Rob Norris	92d3b4ee2c	zio: rename `io_reexecute` as `io_post`; use it for the direct IO checksum error flag We're not supposed to modify someone else's io_flags, so we need another way to propagate DIO_CHKSUM_ERR. If we squint, we can see that io_reexecute is really just recording exceptional events that a parent (or its parents) will need to do something about. It just happens that the only things we've had historically are two forms of reexecution: now or later (suspend). So, rename it to io_post, as in, post-IO info/events/actions. And now we have a few spare bits for other conditions. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17507	2025-07-04 23:16:14 -04:00
Alexander Motin	4e92aee233	Relax special_small_blocks restrictions special_small_blocks is applied to blocks after compression, so it makes no sense to demand its values to be power of 2. At most they could be multiple of 512, but that would still buy us nothing, so lets allow them be any within SPA_MAXBLOCKSIZE. Also special_small_blocks does not really need to depend on the set recordsize, enabled pool features or presence of special vdev. At worst in any of those cases it will just do nothing, so we should not complicate users lives by artificial limitations. While there, polish comments for recordsize and volblocksize. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17497	2025-07-02 11:11:37 -07:00
Olivier Certner	dee62e074a	spa: ZIO_TASKQ_ISSUE: Use symbolic priority This allows to change the meaning of priority differences in FreeBSD without requiring code changes in ZFS. This upstreams commit fd141584cf89d7d2 from FreeBSD src. Sponsored-by: The FreeBSD Foundation Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Olivier Certner <olce@FreeBSD.org> Closes #17489	2025-06-30 10:24:23 -04:00
Rob Norris	46a4075100	Linux 6.16: remove writepage and readahead_page Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17443	2025-06-23 15:51:02 -04:00
Rob Norris	238eab7dc1	FreeBSD: zfs_putpages: don't undirty pages until after write completes zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-12 14:45:18 -07:00
Alexander Motin	e0ef4d2768	Improve block cloning transactions accounting Previous dmu_tx_count_clone() was broken, stating that cloning is similar to free. While they might be from some points, cloning is not net-free. It will likely consume space and memory, and unlike free it will do it no matter whether the destination has the blocks or not (usually not, so previous code did nothing). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17431	2025-06-11 11:59:16 -07:00
Rob Norris	560e3170ef	dsl_dataset: rename dmu_objset_clone* to dsl_dataset_clone* And make its check and sync functions visible, so I can hook them up to zcp_synctask. Rename not strictly necessary, but it definitely looks more like a dsl_dataset thing than a dmu_objset thing, to the extent that those things even have a meaningful distinction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17426	2025-06-10 14:52:43 -07:00
Attila Fülöp	b96f1a4b1f	Linux build: silence objtool warnings After #17401 the Linux build produces some stack related warnings. Silence them with the `STACK_FRAME_NON_STANDARD` macro. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Attila Fülöp <attila@fueloep.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17410	2025-06-04 17:40:09 -07:00
Alexander Motin	68817d28c5	Include class name into struct metaslab_class With increasing number of metaslab classes it can be helpful for debugging to know what we are looking at. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17409	2025-06-03 11:12:59 -04:00
Alexander Motin	108562344c	Improve allocation fallback handling Before this change in case of any allocation error ZFS always fallen back to normal class. But with more of different classes available we migth want more sophisticated logic. For example, it makes sense to fall back from dedup first to special class (if it is allowed to put DDT there) and only then to normal, since in a pool with dedup and special classes populated normal class likely has performance characteristics unsuitable for dedup. This change implements general mechanism where fallback order is controlled by the same spa_preferred_class() as the initial class selection. And as first application it implements the mentioned dedup->special->normal fallbacks. I have more plans for it later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17391	2025-05-31 19:12:16 -04:00
Fedor Uporov	e1677d9ee1	ZVOL: Make zvol_prefetch_bytes module parameter platform-independent The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072, same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17385	2025-05-31 09:58:54 -04:00
Rob Norris	e8e602d987	zio_add_child: collapse into a single function The child locking difference is simple enough to handle with a boolean. The actual work is more involved, and it's easy to forget to change things in both places when experimenting. Just collapse them. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17382	2025-05-30 21:18:10 -04:00
Rob Norris	44e3266894	events: include zio type in IO error reports Usually the IO type can be inferred from the other fields (in particular, priority and flags) sometimes it's not easy to see. This is just another little debug helper. May 27 2025 00:54:54.024110493 ereport.fs.zfs.data class = "ereport.fs.zfs.data" ena = 0x1f5ecfae600801 ... zio_delta = 0x0 zio_type = 0x2 [WRITE] zio_priority = 0x3 [ASYNC_WRITE] zio_objset = 0x0 Document zio_type and zio_priority. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17381	2025-05-30 10:29:29 -04:00
Fedor Uporov	3dfa98d013	ZVOL: Make zvol_inhibit_dev module parameter platform-independent The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17384	2025-05-29 09:37:41 -04:00
Rob Norris	9392be427e	tunables: remove __check_old_set_param workaround This was fully removed from Linux in 4.15, so we won't be seeing it again. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	dd4e2f99f0	tunables: remove unused param get/set aliases Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	589d99171f	tunables: use Linux ullong param ops for u64 Since 3.17 Linux has provided param ops for 64-bit ints, so we don't need to use our own anymore. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	6e7e7ea7ef	tunables: remove support for s64 tunables Nothing uses them now. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	7b183f1918	tunables: remove FreeBSD compat macros for Linux module params Nothing in any FreeBSD code uses them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	4653e2f7d3	dmu_tx: break tx assign/wait when pool suspends This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool suspends while they're waiting, rather than just on the initial check before falling back into a wait. Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore suspend entirely, effectively returning to the previous behaviour. With that, it shouldn't be possible for anything with a standard dmu_tx_assign/wait/abort loop to block under failmode=continue. Also should be a bit tighter than the old behaviour, where a VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool is already suspended and failmode=continue. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:28:51 -07:00
Rob Norris	ac2e579521	dmu_tx: make DMU_TX_* flags an enum Mostly for a little more type checking and debugging visibility. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:28:46 -07:00
Rob Norris	468d22d60c	txg_wait_synced_flags: add TXG_WAIT_SUSPEND flag to not wait if pool suspended This allows a caller to request a wait for txg sync, with an appropriate error return if the pool is suspended or becomes suspended during the wait. To support this, txg_wait_kick() is added to signal the sync condvar, which wakes up the waiters, causing them to loop and reconsider their wait conditions again. zio_suspend() now calls this to trigger the break if the pool suspends while waiting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17355	2025-05-28 10:27:46 -07:00
Rob Norris	06fa8f3f69	zfs_cmd: reorganise zfs_cmd_t to match original size `2aa3fbe761` extended zinject_record_t, and in doing so inadvertently extended zfs_cmd_t, which broke compatibility with userspace tools without the change. This fixes that by using some of the unused space in zfs_cmd_t for the extra fields. We also add an assert to trigger a compile error if the size ever changes. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17367	2025-05-27 20:01:06 -04:00
Ameer Hamza	2a91d577b1	Expose dataset encryption status via fast stat path In truenas_pylibzfs, we query list of encrypted datasets several times, which is expensive. This commit exposes a public API zfs_is_encrypted() to get encryption status from fast stat path without having to refresh the properties. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17368	2025-05-26 22:11:03 -04:00
Alexander Motin	d5616ad34a	Increase meta-dnode redundancy in "some" mode Loss of one indirect block of the meta dnode likely means loss of the whole dataset. It is worse than one file that the man page promises, and in my opinion is not much better than "none" mode. This change restores redundancy of the meta-dnode indirect blocks, while same time still corrects expectations in the man page. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17339	2025-05-16 13:23:32 -04:00
Alexander Motin	89a8a91582	ARC: Notify dbuf cache about target size reduction ARC target size might drop significantly under memory pressure, especially if current ARC size was much smaller than the target. Since dbuf cache size is a fraction of the target ARC size, it might need eviction too. Aside of memory from the dbuf eviction itself, it might help ARC by making more buffers evictable. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17314	2025-05-14 10:34:14 -04:00
Alexander Motin	0aa83dce99	Linux: Stop using NR_FILE_PAGES for ARC scaling I've found that QEMU/KVM guest memory accounted as shared also included into NR_FILE_PAGES. But it is actually a non-evictable anonymous memory. Using it as a base for zfs_arc_pc_percent parameter makes ARC to ignore shrinker requests while page cache does not really have anything to evict, ending up in OOM killer killing the QEMU process. Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent the part of a page cache that is actually evictable, which should be safer to use as a reference for ARC scaling. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17334	2025-05-14 09:29:02 -04:00
Alexander Motin	734eba251d	Wire O_DIRECT also to Uncached I/O (#17218 ) Before Direct I/O was implemented, I've implemented lighter version I called Uncached I/O. It uses normal DMU/ARC data path with some optimizations, but evicts data from caches as soon as possible and reasonable. Originally I wired it only to a primarycache property, but now completing the integration all the way up to the VFS. While Direct I/O has the lowest possible memory bandwidth usage, it also has a significant number of limitations. It require I/Os to be page aligned, does not allow speculative prefetch, etc. The Uncached I/O does not have those limitations, but instead require additional memory copy, though still one less than regular cached I/O. As such it should fill the gap in between. Considering this I've disabled annoying EINVAL errors on misaligned requests, adding a tunable for those who wants to test their applications. To pass the information between the layers I had to change a number of APIs. But as side effect upper layers can now control not only the caching, but also speculative prefetch. I haven't wired it to VFS yet, since it require looking on some OS specifics. But while there I've implemented speculative prefetch of indirect blocks for Direct I/O, controllable via all the same mechanisms. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Fixes #17027 Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-05-13 14:26:55 -07:00
Rob Norris	b2284aedab	metaslab_alloc: make hint BP and DVA const (#17324 ) Nothing modifies them, and nothing should, so lets try to enforce that. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru>	2025-05-12 10:52:46 -07:00
Alexander Motin	49fbdd4533	Introduce zfs rewrite subcommand (#17246 ) This allows to rewrite content of specified file(s) as-is without modifications, but at a different location, compression, checksum, dedup, copies and other parameter values. It is faster than read plus write, since it does not require data copying to user-space. It is also faster for sync=always datasets, since without data modification it does not require ZIL writing. Also since it is protected by normal range range locks, it can be done under any other load. Also it does not affect file's modification time or other properties. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-05-12 10:22:17 -07:00
Rob Norris	2ee5b51a57	linux/uio: remove "skip" offset for UIO_ITER For UIO_ITER, we are just wrapping a kernel iterator. It will take care of its own offsets if necessary. We don't need to do anything, and if we do try to do anything with it (like advancing the iterator by the skip in zfs_uio_advance) we're just confusing the kernel iterator, ending up at the wrong position or worse, off the end of the memory region. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17298	2025-05-11 12:46:40 -04:00
Fedor Uporov	1a8f5ad3b0	zvol: Enable zvol threading functionality on FreeBSD Make zvol I/O requests processing asynchronous on FreeBSD side in some cases. Clone zvol threading logic and required module parameters from Linux side. Make zvol threadpool creation/destruction logic shared for both Linux and FreeBSD. The IO requests are processed asynchronously in next cases: - volmode=geom: if IO request thread is geom thread or cannot sleep. - volmode=cdev: if IO request passed thru struct cdevsw .d_strategy routine, mean is AIO request. In all other cases the IO requests are processed synchronously. The volthreading zvol property is ignored on FreeBSD side. Sponsored-by: vStack, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: @ImAwsumm Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17169	2025-05-08 15:25:40 -04:00
Alan Somers	f13d760aa8	Delete dead code: dbuf_loan_arcbuf It's been dead ever since `5fa356ea44` Sponsored by: ConnectWise Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17119	2025-05-08 10:34:11 -04:00
Rob Norris	4800181b3b	ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat These are only required to support these ioctls on Linux <4.5. Since 4.18 is our cutoff, we don't need this code anymore. Also removing related test things that will never match again. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17308	2025-05-08 10:32:52 -04:00
Paul Dagnelie	246e5883bb	Implement allocation size ranges and use for gang leaves (#17111 ) When forced to resort to ganging, ZFS currently allocates three child blocks, each one third of the size of the original. This is true regardless of whether larger allocations could be made, which would allow us to have fewer gang leaves. This improves performance when fragmentation is high enough to require ganging, but not so high that all the free ranges are only just big enough to hold a third of the recordsize. This is also useful for improving the behavior of a future change to allow larger gang headers. We add the ability for the allocation codepath to allocate a range of sizes instead of a single fixed size. We then use this to pre-allocate the DVAs for the gang children. If those allocations fail, we fall back to the normal write path, which will likely re-gang. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-05-02 15:32:18 -07:00

1 2 3 4 5 ...

2582 Commits