mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-24 03:08:51 +03:00

Author	SHA1	Message	Date
Rob Norris	7ca7bb7fd7	Linux 5.16: use bdev_nr_bytes() to get device capacity This helper was introduced long ago, in 5.16. Since 6.10, bd_inode no longer exists, but the helper has been updated, so detect it and use it in all versions where it is available. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:10:06 -07:00
Rob Norris	e951dba48a	Linux 6.10: work harder to avoid kmem_cache_alloc reuse Linux 6.10 change kmem_cache_alloc to be a macro, rather than a function, such that the old #undef for it in spl-kmem-cache.c would remove its definition completely, breaking the build. This inverts the model used before. Rather than always defining the kmem_cache_* macro, then undefining then inside spl-kmem-cache.c, instead we make a special tag to indicate we're currently inside spl-kmem-cache.c, and not defining those in macros in the first place, so we can use the kernel-supplied kmem_cache_* functions to implement spl_kmem_cache_*, as we expect. For all other callers, we create the macros as normal and remove access to the kernel's own conflicting names. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:10:02 -07:00
Rob Norris	b409892ae5	Linux 6.10: rework queue limits setup Linux has started moving to a model where instead of applying block queue limits through individual modification functions, a complete limits structure is built up and applied atomically, either when the block device or open, or some time afterwards. As of 6.10 this transition appears only partly completed. This commit matches that model within OpenZFS in a way that should work for past and future kernels. We set up a queue limits structure with any limits that have had their modification functions removed. For newer kernels that can have limits applied at block device open (HAVE_BLK_ALLOC_DISK_2ARG), we have a conversion function to turn the OpenZFS queue limits structure into Linux's queue_limits structure, which can then be passed in. For older kernels, we provide an application function that just calls the old functions for each limit in the structure. Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2024-07-15 17:09:55 -07:00
Mateusz Guzik	a7fc4c85e3	zstd: don't call zstd_mempool_reap if there are no buffers (#16302 ) zfs_zstd_cache_reap_now is issued every second. zstd_mempool_reap checks for both pool existence and buffer count, but that's still 2 func calls which are trivially avoidable. With clang it even avoids pushing the stack pointer (but still suffers the mispredict due to a forward jump, not modified in case someone is using zstd): <+0>: cmpq $0x0,0x0(%rip) # <zfs_zstd_cache_reap_now+8> <+8>: je 0x217de4 <zfs_zstd_cache_reap_now+36> <+10>: push %rbp <+11>: mov %rsp,%rbp <+14>: mov 0x0(%rip),%rdi # <zfs_zstd_cache_reap_now+21> <+21>: call 0x217df0 <zstd_mempool_reap> <+26>: mov 0x0(%rip),%rdi # <zfs_zstd_cache_reap_now+33> <+33>: pop %rbp <+34>: jmp 0x217df0 <zstd_mempool_reap> <+36>: ret Preferably the call would not be made to begin with if zstd is not used, but this retains all the logic confined to zstd code. Sponsored by: Rubicon Communications, LLC ("Netgate") Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-15 14:51:37 -07:00
George Amanakis	c87cb22ba9	head_errlog: fix use-after-free In the commit of the head_errlog feature we introduced a bug in dsl_dataset_promote_sync(): we may dereference origin_head and hds, both dereferencing ddpa after calling promote_sync() on ddpa. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16272 Closes #16273	2024-07-15 09:05:42 -07:00
Mark Johnston	a10faf5ce6	FreeBSD: Use the new freeuio() helper to free dynamically allocated UIOs (#16300 ) This freeuio() interface was introduced to FreeBSD recently. For now it simply calls free(), so this change has no effect. However, this may not always be true, and in CheriBSD this change is required. Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brooks Davis <brooks.davis@sri.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-11 16:52:51 -07:00
Tony Hutter	156a64161b	Linux 6.9: Fix UBSAN errors in zap_micro.c You can use the UBSAN_SANITIZE_* Kbuild options to exclude certain kernel objects from the UBSAN checks. We previously excluded zap_micro.o with: UBSAN_SANITIZE_zap_micro.o := n For some reason that didn't work for the 6.9 kernel, which wants us to use: UBSAN_SANITIZE_zfs/zap_micro.o := n Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #16278 Closes #16330	2024-07-11 16:41:26 -07:00
Mark Johnston	4367312760	zvol: Fix suspend lock leaks (#16270 ) In several functions, we use a flag variable to track whether zv_suspend_lock is held. This flag was not getting reset in a particular case where we need to retry the underlying operation, resulting in a lock leak. Make sure to update the flag where necessary. Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-10 14:27:44 -07:00
Tony Hutter	49f3ce3385	Linux 6.9: Call add_disk() from workqueue to fix zfs_allow_010_pos (#16282 ) The 6.9 kernel behaves differently in how it releases block devices. In the common case it will async release the device only after the return to userspace. This is different from the 6.8 and older kernels which release the block devices synchronously. To get around this, call add_disk() from a workqueue so that the kernel uses a different codepath to release our zvols in the way we expect. This stops zfs_allow_010_pos from hanging. Fixes: #16089 Signed-off-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Rob Norris <rob.norris@klarasystems.com>	2024-06-28 09:52:03 -07:00
Mateusz Guzik	121a2d3354	FreeBSD: unregister mountroot eventhandler on unload Otherwise if zfs is unloaded and reroot is being used it trips over a stale pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: Rubicon Communications, LLC ("Netgate") Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #16242	2024-06-13 17:49:50 -07:00
bnovkov	20c8bdd85e	FreeBSD: Update use of UMA-related symbols in arc_available_memory Recent UMA changes repurposed the use of UMA_MD_SMALL_ALLOC in a way that breaks arc_available_memory on -CURRENT. This change ensures that arc_available_memory uses the new symbol while maintaining compatibility with older FreeBSD releases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Bojan Novković <bnovkov@FreeBSD.org> Closes #16230	2024-06-06 18:11:00 -07:00
Rob Norris	a72751a342	icp: remove redundant FreeBSD check We don't build illumos-crypto for FreeBSD. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:59 -07:00
Rob Norris	4e714c0be1	icp: remove unused headers Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:51 -07:00
Rob Norris	ae512620d0	icp: remove skein module Nothing calls it through the KCF interface, so this is all unused. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:39 -07:00
Rob Norris	f39241aeb3	icp: remove unused SHA2 HMAC mechanisms Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:30 -07:00
Rob Norris	10de12e9ed	icp: reorganise SHA2 digest mechanisms sha2_mech_type_t serves double-duty, as the list of MAC providers and also the algo type for direct callers to SHA2Init. Until we disentangle that, reorganise it to make the separation more clear. While we're there, remove the digest mechs we don't use. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:23 -07:00
Rob Norris	1291c46ea4	icp: remove digest entry points For whatever reason, we call digest mechanisms directly, not through the KCF digest provider. So we can remove those entry points entirely. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:16 -07:00
Rob Norris	94f1e56e41	icp: remove unused KCF_ macros Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:13:06 -07:00
Rob Norris	4ed91dc26e	icp: remove unusued incremental cipher methods Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:12:59 -07:00
Rob Norris	57249bcddc	icp: brutally remove unused AES modes Still retaining the struture, for now. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:12:51 -07:00
Rob Norris	4185179190	icp: remove unused blowfish_ctx and des_ctx Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16209	2024-05-31 15:12:31 -07:00
Zhenlei Huang	e2357561b9	FreeBSD: Add const qualifier to members of struct opensolaris_utsname These members have directly references to the global variables exposed by the kernel. They are not going to be changed by this kernel module. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Zhenlei Huang <zlei@FreeBSD.org> Closes #16210	2024-05-30 09:58:20 -07:00
Pawel Jakub Dawidek	01c8efdd59	Simplify issig(). We always call it twice with JUSTLOOKING and then FORREAL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #16225	2024-05-29 10:49:11 -07:00
Brian Behlendorf	6b95031f56	zed: Add deadman-slot_off.sh zedlet Optionally turn off disk's enclosure slot if an I/O is hung triggering the deadman. It's possible for outstanding I/O to a misbehaving SCSI disk to neither promptly complete or return an error. This can occur due to retry and recovery actions taken by the SCSI layer, driver, or disk. When it occurs the pool will be unresponsive even though there may be sufficient redundancy configured to proceeded without this single disk. When a hung I/O is detected by the kmods it will be posted as a deadman event. By default an I/O is considered to be hung after 5 minutes. This value can be changed with the zfs_deadman_ziotime_ms module parameter. If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set the disk's enclosure slot will be powered off causing the outstanding I/O to fail. The ZED will then handle this like a normal disk failure. By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set. As part of this change `zfs_deadman_events_per_second` is added to control the ratelimitting of deadman events independantly of delay events. In practice, a single deadman event is sufficient and more aren't particularly useful. Alphabetize the zfs_deadman_* entries in zfs.4. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #16226	2024-05-29 10:46:41 -07:00
Alexander Motin	800d59d577	Some improvements to metaslabs eviction - Add old eviction for special and dedup metaslab classes. Those vdevs may be potentially big and fragmented with large metaslabs, while their asynchronous write pattern is not really different from normal class. It seems an omission to not evict old metaslabs from them. - If we have metaslab preload enabled, which means we are not too low on memory, do not evict active metaslabs even if they are not used for some time. Eviction of active metaslabs means we won't be able to write anything until we load them, that may take some time, that is straight opposite to metaslab preload goals. For small systems the memory saving should be less important after recent reduction in number of allocators and so open metaslabs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16214	2024-05-29 08:53:31 -07:00
Alexander Motin	02c5aa9b09	Destroy ARC buffer in case of fill error In case of error dmu_buf_fill_done() returns the buffer back into DB_UNCACHED state. Since during transition from DB_UNCACHED into DB_FILL state dbuf_noread() allocates an ARC buffer, we must free it here, otherwise it will be leaked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665 Closes #15802 Closes #16216	2024-05-24 19:11:18 -07:00
George Amanakis	8865dfbcaa	Fix assertion in Persistent L2ARC At the end of l2arc_evict() fix an assertion in the case that l2ad_hand + distance == l2ad_end. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #16202 Closes #16207	2024-05-24 19:02:58 -07:00
Rob N	d0aa9dbccf	Use memset to zero stack allocations containing unions C99 6.7.8.17 says that when an undesignated initialiser is used, only the first element of a union is initialised. If the first element is not the largest within the union, how the remaining space is initialised is up to the compiler. GCC extends the initialiser to the entire union, while Clang treats the remainder as padding, and so initialises according to whatever automatic/implicit initialisation rules are currently active. When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN, -ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag sets the policy for automatic/implicit initialisation of variables on the stack. Taken together, this means that when compiling under CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only zero the first element in a union, and the rest will be filled with a pattern. This is significant for aes_ctx_t, which in aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero, but then used as a gcm_ctx_t, which is the fifth element in the union, and thus gets pattern initialisation. Later, it's assumed to be zero, resulting in a hang. As confusing and undiscoverable as it is, by the spec, we are at fault when we initialise a structure containing a union with the zero initializer. As such, this commit replaces these uses with an explicit memset(0). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16135 Closes #16206	2024-05-24 19:00:29 -07:00
Rob N	34906f8bbe	zap: reuse zap_leaf_t on dbuf reuse after shrink If a shrink or truncate had recently freed a portion of the ZAP, the dbuf could still be sitting on the dbuf cache waiting for eviction. If it is then allocated for a new leaf before it can be evicted, the zap_leaf_t is still attached as userdata, tripping the VERIFY. Instead, just check for the userdata, and if we find it, reuse it. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16157. Closes #16204	2024-05-24 18:55:47 -07:00
Brooks Davis	7572e8ca04	Avoid a gcc -Wint-to-pointer-cast warning On 32-bit platforms long long is generally 64-bits. Sufficiently modern versions of gcc (13 in my testing) complains when casting a pointer to an integer of a different width so cast to uintptr_t first to avoid the warning. Fixes: `c183d164aa` Parallel pool import Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #16203	2024-05-24 18:45:58 -07:00
Pawel Jakub Dawidek	08648cf0da	Allow block cloning to be interrupted by a signal. Even though block cloning is much faster than regular copying, it is not instantaneous - the file might be large and the recordsize small. It would be nice to be able to interrupt it with a signal (e.g., SIGINFO on FreeBSD to see the progress). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #16208	2024-05-24 18:45:09 -07:00
Alexander Motin	efbef9e6cc	FreeBSD: Add zfs_link_create() error handling Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138	2024-05-16 17:56:55 -07:00
Rob N	e675852bc1	dbuf: separate refcount calls for dbuf and dbuf_user In `92dc4ad83` I updated the dbuf_cache accounting to track the size of userdata associated with dbufs. This adds the size of the dbuf+userdata together in a single call to zfs_refcount_add_many(), but sometime removes them in separate calls to zfs_refcount_remove_many(), if dbuf and userdata are evicted separately. What I didn't realise is that when refcount tracking is on, zfs_refcount_add_many() and zfs_refcount_remove_many() are expected to be paired, with their second & third args (count & holder) the same on both sides. Splitting the remove part into two calls means the counts don't match up, tripping a panic. This commit fixes that, by always adding and removing the dbuf and userdata counts separately. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reported-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16191	2024-05-15 13:03:41 -07:00
Rob Norris	3c941d1818	zdb/ztest: send dbgmsg output to stderr And, make the output fd an arg to zfs_dbgmsg_print(). This is a change in behaviour, but keeps it consistent with where crash traces go, and it's easy to argue this is what we want anyway; this is information about the task, not the actual output of the task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16181	2024-05-14 09:49:00 -07:00
Rob Norris	fa99d9cd9c	zfs_dbgmsg_print: make FreeBSD and Linux consistent FreeBSD was using fprintf(), which might not be signal-safe. Meanwhile, Linux's locking did not cover the header output. This two quirks are unrelated, but both have the same response: be like the other one. So with this commit, both functions are the same except for the names of their lock and list variables. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16181	2024-05-14 09:48:56 -07:00
Rob Norris	0a543db371	spa_taskq_dispatch_ent: simplify arguments This renames it to spa_taskq_dispatch(), and reduces and simplifies its arguments based on these observations from its two call sites: - arg is always the zio, so it can be typed that way, and we don't need to provide it twice; - ent is always &zio->io_tqent, and zio is always provided, so we can use it directly; - the only flag used is TQ_FRONT, which can just be a bool; - zio != NULL was part of the "use allocator" test, but it never would have got that far, because that arg was only set to NULL in the reexecute path, which is forced to type CLAIM, so the condition would fail at t == WRITE anyway. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16151	2024-05-14 09:40:16 -07:00
Rob Norris	515c4dd213	spa: flatten spa_taskq_dispatch_ent() It is the only user of spa_taskq_dispatch_select(), so might as well just carry it directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16151	2024-05-14 09:40:09 -07:00
Rob Norris	adda768e3e	spa: remove spa_taskq_dispatch_sync() It has no callers anymore. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16151	2024-05-14 09:40:02 -07:00
Rob Norris	cc38691534	zfs_ioc_send: use a dedicated taskq thread for send When stack space is tight, the stream is written to its target on a separate taskq thread to make sure there's enough stack space to complete it. This has always used an IO taskq, but that doesn't really make sense for it, and moving it onto a regular taskq lets us get rid of spa_taskq_dispatch_sync(), which is not used anywhere else. Stream writes may block for a long time depending on what the target is, and we have no way of discovering this, so we can't risk using the system taskq, as there may be many tens of sends in progress. Instead, we create a dedicated taskq thread for each send writer to run on, and clean it up when it's done. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16151	2024-05-14 09:39:26 -07:00
Don Brady	89acef992b	Simplified the scope of the namespace lock If we wait until after we check for no spa references to drop the namespace lock, then we know that spa consumers will need to call spa_lookup() and end up waiting on the spa_namespace_cv until we finish. This narrows the external checks to spa_lookup and we no longer need to worry about the spa_vdev_enter case. Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16153	2024-05-14 08:58:15 -07:00
Don Brady	975a13259b	Add support for parallel pool exports Changed spa_export_common() such that it no longer holds the spa_namespace_lock for the entire duration and instead sets spa_export_thread to indicate an import is in progress on the spa. This allows for an export to a diffent pool to proceed in parallel while an export is still processing potentially long operations like spa_unload_log_sm_flush_all(). Calls like spa_lookup() and spa_vdev_enter() that rely on the spa_namespace_lock to serialize them against a concurrent export, now wait for any in-progress export thread to complete before proceeding. The 'zpool import -a' sub-command also provides multi-threaded support, using a thread pool to submit the exports in parallel. Sponsored-By: Klara Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16153	2024-05-14 08:57:41 -07:00
Brian Behlendorf	abec7dcd30	Linux: disable lockdep for a couple of locks When running a debug kernel with lockdep enabled there are several locks which report false positives. Set MUTEX_NOLOCKDEP/RW_NOLOCKDEP to disable these warnings. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #16188	2024-05-13 15:12:07 -07:00
Alexander Motin	136c053211	ZAP: Fix leaf references on zap_expand_leaf() errors Depending on kind of error zap_expand_leaf() may return with or without valid leaf reference held. Make sure it returns NULL if due to error it has no leaf to return. Make its callers to check the returned leaf pointer, and release the leaf if it is not NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #12366 Closes #16159	2024-05-10 12:35:20 -07:00
chenqiuhao1997	41ae864b69	Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com> Closes #15940	2024-05-10 08:47:21 -07:00
Alexander Motin	3400127a75	Fix ZIL clone records for legacy holes Previous code overengineered cloned range calculation by using BP_GET_LSIZE(). The problem is that legacy holes don't have the logical size, so result will be wrong. But we also don't need to look on every block size, since they all must be identical. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16165	2024-05-09 07:39:57 -07:00
Alexander Motin	af5dbed319	Fix scn_queue races on very old pools Code for pools before version 11 uses dmu_objset_find_dp() to scan for children datasets/clones. It calls enqueue_clones_cb() and enqueue_cb() callbacks in parallel from multiple taskq threads. It ends up bad for scan_ds_queue_insert(), corrupting scn_queue AVL-tree. Fix it by introducing a mutex to protect those two scan_ds_queue_insert() calls. All other calls are done from the sync thread and so serialized. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16162	2024-05-09 07:32:59 -07:00
Daniel Perry	2dff7527d4	Replace usage of schedule_timeout with schedule_timeout_interruptible (#16150 ) This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Daniel Perry <dtperry@amazon.com> Closes #16150	2024-05-09 07:30:28 -07:00
Alexander Motin	04bae5ec95	Disable high priority ZIO threads on FreeBSD and Linux High priority threads are handling ZIL writes. While there is no ZIL compression, there is encryption, checksuming and RAIDZ math. We've found that on large systems 1 taskq with 5 threads can be a bottleneck for throughput, IOPS or both. Instead of just bumping number of threads with a risk of overloading CPUs and increasing latency, switch to using TQ_FRONT mechanism to increase sync write requests priority within standard write threads. Do not do it on Illumos, since its TQ_FRONT implementation is inherently unfair. FreeBSD and Linux don't have this problem, so we can do it there. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #16146	2024-05-03 09:53:34 -07:00
Rob N	8f1b7a6fa6	vdev_disk: disable flushes if device does not support it If the underlying device doesn't have a write-back cache, the kernel will just return a successful response. This doesn't hurt anything, but it's extra work on the IO taskqs that are unnecessary. So, detect this when we open the device for the first time. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16148	2024-05-02 15:18:35 -07:00
Alexander Motin	645b833079	Improve write issue taskqs utilization - Reduce number of allocators on small system down to one per 4 CPU cores, keeping maximum at 4 on 16+ core systems. Small systems should not have the lock contention multiple allocators supposed to solve, while having several metaslabs open and modified each TXG is not free. - Reduce number of write issue taskqs down to one per 16 CPU cores and an integer fraction of number of allocators. On mid- sized systems, where multiple allocators already make sense, too many write issue taskqs may reduce write speed on single-file workloads, since single file is handled by only one taskq to reduce fragmentation. On large systems, that can actually benefit from many taskq's better IOPS, the bottleneck is less important, since in worst case there will be at least 16 cores to handle it. - Distribute dnodes between allocators (and taskqs) in a round- robin fashion instead of relying on sync taskqs to be balanced. The last is not guarantied and may depend on scheduling. - Remove io_wr_iss_tq from struct zio. io_allocator is enough. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16130	2024-05-01 11:07:20 -07:00

1 2 3 4 5 ...

4516 Commits