mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-26 12:12:13 +03:00

Author	SHA1	Message	Date
Gleb Smirnoff	83f359245a	FreeBSD: fix build without kernel option MAC Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gleb Smirnoff <glebius@FreeBSD.org> Closes #16446	2024-08-15 09:08:43 -07:00
Jitendra Patidar	d2ccc21552	Fix projid accounting for xattr objects zpool upgraded with 'feature@project_quota' needs re-layout of SA's to fix the SA_ZPL_PROJID at SA_PROJID_OFFSET (128). Its necessary for the correct accounting of object usage against its projid. Old object (created before upgrade) when gets a projid assigned, its SA gets re-layout via sa_add_projid(). If object has xattr dir, SA of xattr dir also gets re-layout. But SA re-layout of xattr objects inside a xattr dir is not done. Fix zfs_setattr_dir() to re-layout SA's on xattr objects, when setting projid on old xattr object (created before upgrade). Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #16355 Closes #16356	2024-08-14 17:59:19 -07:00
Paul Dagnelie	244ea5c488	Add missing kstats to dataset kstats Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #16431	2024-08-14 14:18:46 -07:00
Rob Norris	2633075e09	Linux 6.11: avoid passing "end" sentinel to register_sysctl() Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:22 -07:00
Rob Norris	3abffc8781	Linux 6.11: add compat macro for page_mapping() Since the change to folios it has just been a wrapper anyway. Linux has removed their wrapper, so we add one. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:18 -07:00
Rob Norris	f5236fe47a	Linux 6.11: add more queue_limit fields with removed setters These fields are very old, so no detection necessary; we just move them into the limit setup functions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:13 -07:00
Rob Norris	0b741a0351	Linux 6.11: IO stats is now a queue feature flag Apply them with with the rest of the settings. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:08 -07:00
Rob Norris	22619523f6	Linux 6.11: first arg to proc_handler is now const Detect it, and use a macro to make sure we always match the prototype. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:47:01 -07:00
Rob Norris	e95b732e49	Linux 6.11: enable queue flush through queue limits In 6.11 struct queue_limits gains a 'features' field, where, among other things, flush and write-cache are enabled. Detect it and use it. Along the way, the blk_queue_set_write_cache() compat wrapper gets a little cleanup. Since both flags are alway set together, its now a single bool. Also the very very ancient version that sets q->flush_flags directly couldn't actually turn it off, so I've fixed that. Not that we use it, but still. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:46:41 -07:00
Rob Norris	767b37019f	linux/zvol_os: tidy and document queue limit/config setup It gets hairier again in Linux 6.11, so I want some actual theory of operation laid out for next time. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #16400	2024-08-13 17:45:12 -07:00
Alan Somers	ed0db1cc8b	Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in #7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: https://github.com/openzfs/zfs/pull/9115 [^2]: https://github.com/openzfs/zfs/pull/8987 [^3]: https://github.com/freebsd/freebsd-src/commit/10b9d77bf1ccf2f3affafa6261692cb92cf7e992 [^4]: https://github.com/illumos/illumos-gate/commit/5aaeed5c617553c4cec6328c1f4c19079a5a495a [^5]: https://github.com/openzfs/zfs/pull/7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #16268	2024-08-09 14:32:59 -07:00
Rob Norris	b0bf14cdb5	abd: lift ABD zero scan from zio_compress_data() to abd_cmp_zero() It's now the caller's responsibility do special handling for holes if that's something it wants. This also makes zio_compress_data() and zio_decompress_data() properly the inverse of each other. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Jason Lee <jasonlee@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16326	2024-08-09 14:30:26 -07:00
Alexander Motin	3ae05e34e5	Linux: Make zfs_prune() fair on NUMA systems Previous code evicted nr_to_scan items from each NUMA node. This not only multiplied the eviction by the number of nodes, but could exhaust the smaller ones, evicting inodes used by acive workload and requiring their immediate recreation. This patch spreads the requested eviction between all NUMA nodes proportionally to their evictable counts, which should be closer to expected LRU logic. See kernel's super_cache_scan() as a similar logic example. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16397	2024-08-08 15:33:36 -07:00
Alexander Motin	5b9f3b7664	Soften pruning threshold on not evictable metadata Previous code pruned 10% of dnodes once 3/4 of metadata appeared unevictable. On workloads with many millions of dnodes and little other metadata it creates significant load spikes for many seconds straight. This change instead gradually increases pruning as unevictable metadata grow above the 3/4, which may allow it to stabilize at some level. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16401	2024-08-08 15:26:35 -07:00
Alexander Motin	aef452f108	Improve zfs_blkptr_verify() - Skip config lock enter/exit for embedded blocks. They have no DVAs, so there is nothing to check under the lock. - Skip CHECKSUM check and properly check PSIZE for embedded blocks. - Add static branch predictions for unlikely conditions. - Do not verify DVAs for blocks already in ARC. ARC hit already "verified" the first (often the only) DVA, and it does not worth to enter/exit config lock for nothing. Some profiles show me up to 3% of CPU saving from this change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16387	2024-08-08 15:25:10 -07:00
Ameer Hamza	5536c0dee2	Sync AUX label during pool import Spare and l2cache vdev labels are not updated during import. Therefore, if disk paths are updated between pool export and import, the AUX label still shows the old paths. This patch syncs the AUX label during import to show the correct path information. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #15817	2024-08-08 15:16:46 -07:00
Mark Johnston	dea8fabf73	FreeBSD: Fix RLIMIT_FSIZE handling for block cloning ZFS implements copy_file_range(2) using block cloning when possible. This implementation must respect the RLIMIT_FSIZE limit. zfs_clone_range() already checks the limit, so it is safe to remove this check in zfs_freebsd_copy_file_range(). Moreover, the removed check produces false positives: the length passed to copy_file_range(2) may be larger than the input file size; as the man page notes, "for best performance, call copy_file_range() with the largest len value possible." In particular, some existing code passes SSIZE_MAX there. The check in zfs_clone_range() clamps the length to the input file's size before checking, but the removed check uses the caller supplied length, so something like $ echo a > /tmp/foo $ limits -f 1024 cat /tmp/foo > /tmp/bar fails because FreeBSD's cat(1) uses copy_file_range(2) in the manner described above. Reported-by: Philip Paeps <philip@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-08 10:43:02 -07:00
Rob Norris	670147be53	zvol: ensure device minors are properly cleaned up Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14872 Closes #16364	2024-08-06 12:08:14 -07:00
Rob Norris	88aab1d2d0	linux/zvol_os: fix SET_ERROR with negative return codes SET_ERROR is our facility for tracking errors internally. The negation is to match the what the kernel expects from us. Thus, the negation should happen outside of the SET_ERROR. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16364	2024-08-06 12:07:31 -07:00
Rob Norris	6c82951d11	FreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372 ) This includes the last 12.x release (now EOL) and 13.0 development versions (<1300139). Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-05 16:56:45 -07:00
Alexander Motin	cdd53fea1e	FreeBSD: Add missing memory reclamation accounting Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-05 09:21:29 -07:00
Brian Atkinson	c8184d714b	Block cloning conditionally destroy ARC buffer dmu_buf_will_clone() calls arc_buf_destroy() if there is an associated ARC buffer with the dbuf. However, this can only be done conditionally. If the previous dirty record's dr_data is pointed at db_dbf then destroying it can lead to NULL pointer deference when syncing out the previous dirty record. This updates dmu_buf_fill_clone() to only call arc_buf_destroy() if the previous dirty records dr_data is not pointing to db_buf. The block clone wil still set the dbuf's db_buf and db_data to NULL, but this will not cause any issues as any previous dirty record dr_data will still be pointing at the ARC buffer. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #16337	2024-08-01 18:22:43 -07:00
Tino Reichardt	c092bddfe7	Fix sa.c to build on FreeBSD again. (#16403 ) Fix multiple build errors on FreeBSD. The main reason is, that the variable 'dxattr_obj' is used uninitialized within the start of the 'out label'. Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org>	2024-08-01 13:04:08 -07:00
Jitendra Patidar	d60debbf59	Fix sa_add_projid to lookup and update SA_ZPL_DXATTR (avoid DXATTR loss) (#16288 ) sa_add_projid() gets called via zfs_setattr() for setting project id on old file/dir, which were created before upgrading to project quota feature. This function does lookup for all possible SA and update them all together along with project ID at needed fixed offset. But its missing lookup and update of SA_ZPL_DXATTR, effectively it losses SA_ZPL_DXATTR. Closes #16287 Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>	2024-07-31 18:41:49 -07:00
c1ick	ec580bc520	zfs: add bounds checking to zil_parse (#16308 ) Make sure log record don't stray beyond valid memory region. There is a lack of verification of the space occupied by fixed members of lr_t in the zil_parse. We can create a crafted image to trigger an out of bounds read by following these steps: 1) Do some file operations and reboot to simulate abnormal exit without umount 2) zil_chain.zc_nused: 0x1000 3) First lr_t lr_t.lrc_txtype: 0x0 lr_t.lrc_reclen: 0x1000-0xb8-0x1 lr_t.lrc_txg: 0x0 lr_t.lrc_seq: 0x1 4) Update checksum in zil_chain.zc_eck Fix: Add some checks to make sure the remaining bytes are large enough to hold an log record. Signed-off-by: XDTG <click1799@163.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-31 17:17:04 -07:00
Alexander Motin	d4b5517ef9	Linux: Report reclaimable memory to kernel as such (#16385 ) Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to mark memory allocations that can be freed via shinker calls. It should allow kernel to tune and group such allocations for lower memory fragmentation and better reclamation under pressure. This patch marks as reclaimable most of ARC memory, directly evictable via ZFS shrinker, plus also dnode/znode/sa memory, indirectly evictable via kernel's superblock shrinker. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com>	2024-07-30 11:40:47 -07:00
Rob Norris	d54d0fff39	dnode: allow storage class to be overridden by object type spa_preferred_class() selects a storage class based on (among other things) the DMU object type. This only works for old-style object types that match only one specific kind of thing. For DMU_OTN_ types we need another way to signal the storage class. This commit allows the object type to be overridden in the IO policy for the purposes of choosing a storage class. It then adds the ability to set the storage type on a dnode hold, such that all writes generated under that hold will get it. This method has two shortcomings: - it would be better if we could "name" a set of storage class preferences rather than it being implied by the object type. - it would be better if this info were stored in the dnode on disk. In the absence of those things, this seems like the smallest possible change. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15894	2024-07-29 17:05:41 -07:00
Rob Norris	e26b3771ee	spa_preferred_class: pass the entire zio Rather than picking out specific values out of the properties, just pass the entire zio in, to make it easier in the future to use more of that info to decide on the storage class. I would have rathered just pass io_prop in, but having spa.h include zio.h gets a bit tricky. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15894	2024-07-29 17:05:08 -07:00
Alexander Motin	ed87d456e4	Skip dnode handles use when not needed Neither FreeBSD nor Linux currently implement kmem_cache_set_move(), which means dnode_move() is never called. In such situation use of dnode handles with respective locking to access dnode from dbuf is a waste of time for no benefit. This patch implements optional simplified code for such platforms, saving at least 3 dnode lock/dereference/unlock per dbuf life cycle. Originally I hoped to drop the handles completely to save memory, but they are still used in dnodes allocation code, so left for now. Before this change in CPU profiles of some workloads I saw 4-20% of CPU time spent in zrl_add_impl()/zrl_remove(), which are gone now. Reviewed-by: Rob Wing <rob.wing@klarasystems.com Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16374	2024-07-29 14:48:12 -07:00
Alexander Motin	1a3e32e6a2	Cleanup DB_DNODE() macros usage - Use the macros in few places it was missed. - Reduce scope of DB_DNODE_ENTER/EXIT() and inline some DB_DNODE() uses to make it more obvious what exactly is protected there and make unprotected accesses by mistake more difficult. - Make use of zrl_owner(). Reviewed-by: Rob Wing <rob.wing@klarasystems.com Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16374	2024-07-29 14:47:01 -07:00
Allan Jude	62e7d3c89e	ddt: add support for prefetching tables into the ARC This change adds a new `zpool prefetch -t ddt $pool` command which causes a pool's DDT to be loaded into the ARC. The primary goal is to remove the need to "warm" a pool's cache before deduplication stops slowing write performance. It may also provide a way to reload portions of a DDT if they have been flushed due to inactivity. Sponsored-by: iXsystems, Inc. Sponsored-by: Catalogics, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Will Andrews <will.andrews@klarasystems.com> Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Will Andrews <will.andrews@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Closes #15890	2024-07-26 09:16:18 -07:00
Rob Norris	7ddc1f737f	zil: add stats for commit failure/fallback (#16315 ) There's no good way to tell when a ZIL commit fails and falls back to a transaction sync, other than perhaps a throughput drop. This adds counters so we can see when it happens and why. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-25 16:53:59 -07:00
Alexander Motin	55427add3c	Several improvements to ARC shrinking (#16197 ) - When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-25 10:31:14 -07:00
Allan Jude	c7ada64bb6	ddt: dedup table quota enforcement This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com> Closes #15889	2024-07-25 09:47:36 -07:00
Tony Hutter	a1be921673	Linux 6.9: Fix UBSAN errors in sa.c (#16380 ) This is a follow-on to `156a64161b` that ignores UBSAN errors in sa.c. Thank you @thwalker3 for the fix. Original-patch-by: @thwalker3 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16278 Closes #16330 Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-23 17:13:04 -07:00
Don Brady	fb6d8cf229	Add some missing vdev properties (#16346 ) Sponsored-by: Klara, Inc. Sponsored-By: Wasabi Technology, Inc. Signed-off-by: Don Brady <don.brady@klarasystems.com> Co-authored-by: Don Brady <don.brady@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-23 16:34:09 -07:00
Chunwei Chen	9dfc5c4a0c	Fix long_free_dirty accounting for small files (#16264 ) For files smaller than recordsize, it's most likely that they don't have L1 blocks. However, current calculation will always return at least 1 L1 block. In this change, we check dnode level to figure out if it has L1 blocks or not, and return 0 if it doesn't. This will reduce the chance of unnecessary throttling when deleting a large number of small files. Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-23 11:34:19 -07:00
Rob Norris	5de3ac2236	vdev_open: clear async fault flag after reopen After `c3f2f1aa2`, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev. In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again. The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com>	2024-07-17 10:03:41 -07:00
Jason Lee	41902c8e6d	Use kmap_local_page instead of kmap_atomic (#16329 ) Changed zfs_k(un)map_atomic to zfs_k(un)map_local Signed-off-by: Jason Lee <jasonlee@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov>	2024-07-16 17:27:29 -07:00
Rob Norris	7ca7bb7fd7	Linux 5.16: use bdev_nr_bytes() to get device capacity This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:10:06 -07:00
Rob Norris	e951dba48a	Linux 6.10: work harder to avoid kmem_cache_alloc reuse Linux 6.10 change kmem_cache_alloc to be a macro, rather than a function, such that the old #undef for it in spl-kmem-cache.c would remove its definition completely, breaking the build. This inverts the model used before. Rather than always defining the kmem_cache_* macro, then undefining then inside spl-kmem-cache.c, instead we make a special tag to indicate we're currently inside spl-kmem-cache.c, and not defining those in macros in the first place, so we can use the kernel-supplied kmem_cache_* functions to implement spl_kmem_cache_*, as we expect. For all other callers, we create the macros as normal and remove access to the kernel's own conflicting names. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:10:02 -07:00
Rob Norris	b409892ae5	Linux 6.10: rework queue limits setup Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:09:55 -07:00
Mateusz Guzik	a7fc4c85e3	zstd: don't call zstd_mempool_reap if there are no buffers (#16302 ) zfs_zstd_cache_reap_now is issued every second. zstd_mempool_reap checks for both pool existence and buffer count, but that's still 2 func calls which are trivially avoidable. With clang it even avoids pushing the stack pointer (but still suffers the mispredict due to a forward jump, not modified in case someone is using zstd): <+0>: cmpq $0x0,0x0(%rip) # <zfs_zstd_cache_reap_now+8> <+8>: je 0x217de4 <zfs_zstd_cache_reap_now+36> <+10>: push %rbp <+11>: mov %rsp,%rbp <+14>: mov 0x0(%rip),%rdi # <zfs_zstd_cache_reap_now+21> <+21>: call 0x217df0 <zstd_mempool_reap> <+26>: mov 0x0(%rip),%rdi # <zfs_zstd_cache_reap_now+33> <+33>: pop %rbp <+34>: jmp 0x217df0 <zstd_mempool_reap> <+36>: ret Preferably the call would not be made to begin with if zstd is not used, but this retains all the logic confined to zstd code. Sponsored by: Rubicon Communications, LLC ("Netgate") Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-15 14:51:37 -07:00
George Amanakis	c87cb22ba9	head_errlog: fix use-after-free In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16272 Closes #16273	2024-07-15 09:05:42 -07:00
Mark Johnston	a10faf5ce6	FreeBSD: Use the new freeuio() helper to free dynamically allocated UIOs (#16300 ) This freeuio() interface was introduced to FreeBSD recently. For now it simply calls free(), so this change has no effect. However, this may not always be true, and in CheriBSD this change is required. Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brooks Davis <brooks.davis@sri.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-11 16:52:51 -07:00
Tony Hutter	156a64161b	Linux 6.9: Fix UBSAN errors in zap_micro.c You can use the UBSAN_SANITIZE_* Kbuild options to exclude certain kernel objects from the UBSAN checks. We previously excluded zap_micro.o with: UBSAN_SANITIZE_zap_micro.o := n For some reason that didn't work for the 6.9 kernel, which wants us to use: UBSAN_SANITIZE_zfs/zap_micro.o := n Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16278 Closes #16330	2024-07-11 16:41:26 -07:00
Mark Johnston	4367312760	zvol: Fix suspend lock leaks (#16270 ) In several functions, we use a flag variable to track whether zv_suspend_lock is held. This flag was not getting reset in a particular case where we need to retry the underlying operation, resulting in a lock leak. Make sure to update the flag where necessary. Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-10 14:27:44 -07:00
Tony Hutter	49f3ce3385	Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282 ) The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>	2024-06-28 09:52:03 -07:00
Mateusz Guzik	121a2d3354	FreeBSD: unregister mountroot eventhandler on unload Otherwise if zfs is unloaded and reroot is being used it trips over a stale pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: Rubicon Communications, LLC ("Netgate") Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #16242	2024-06-13 17:49:50 -07:00
bnovkov	20c8bdd85e	FreeBSD: Update use of UMA-related symbols in arc_available_memory Recent UMA changes repurposed the use of UMA_MD_SMALL_ALLOC in a way that breaks arc_available_memory on -CURRENT. This change ensures that arc_available_memory uses the new symbol while maintaining compatibility with older FreeBSD releases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Bojan Novković <bnovkov@FreeBSD.org> Closes #16230	2024-06-06 18:11:00 -07:00

... 2 3 4 5 6 ...

4705 Commits