mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Ameer Hamza	0bcbee6040	Fix snapshot automount race causing duplicate mounts and AVL tree panic Multiple threads racing to automount the same snapshot can both spawn mount helper processes that successfully complete, causing both parent threads to attempt AVL tree registration and triggering a VERIFY() panic in avl_add(). This occurs because the fsconfig/fsmount API lacks the serialization provided by traditional mount() via lock_mount(). The fix adds a per-entry mutex (se_mtx) to zfs_snapentry_t that serializes mount and unmount operations on the same snapshot. The first mount thread creates a pending entry with se_spa=NULL and holds se_mtx during the helper execution. Concurrent mounts find the pending entry and return success without spawning duplicate helpers. Unmount waits on se_mtx if a mount is pending, ensuring proper serialization. This allows different snapshots to mount in parallel while preventing the AVL panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17943	2025-12-10 10:21:29 -08:00
Ameer Hamza	74bbdda1ef	Fix snapshot automount expiry cancellation deadlock A deadlock occurs when snapshot expiry tasks are cancelled while holding locks. The snapshot expiry task (snapentry_expire) spawns an umount process and waits for it to complete. Concurrently, ARC memory pressure triggers arc_prune which calls zfs_exit_fs(), attempting to cancel the expiry task while holding locks. The umount process spawned by the expiry task blocks trying to acquire locks held by arc_prune, which is blocked waiting for the expiry task to complete. This creates a circular dependency: expiry task waits for umount, umount waits for arc_prune, arc_prune waits for expiry task. Fix by adding non-blocking cancellation support to taskq_cancel_id(). The zfs_exit_fs() path calls zfsctl_snapshot_unmount_delay() to reschedule the unmount, which needs to cancel any existing expiry task. It now uses non-blocking cancellation to avoid waiting while holding locks, breaking the deadlock by returning immediately when the task is already running. The per-entry se_taskqid_lock has been removed, with all taskqid operations now protected by the global zfs_snapshot_lock held as WRITER. Additionally, an se_in_umount flag prevents recursive waits when zfsctl_destroy() is called during unmount. The taskqid is now only cleared by the caller on successful cancellation; running tasks clear their own taskqid upon completion. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17941	2025-12-10 10:21:29 -08:00
Ameer Hamza	663dc86de2	Fix taskq NULL pointer dereference on timer race Remove unsafe timer_pending() check in taskq_cancel_id() that created a race where: - Timer expires and timer_pending() returns FALSE - task_done() frees task with tqent_func = NULL - Timer callback executes and queues freed task - Worker thread crashes executing NULL function Always call timer_delete_sync() unconditionally to ensure timer callback completes before task is freed. Reliably reproducible by injecting mdelay(10) after setting CANCEL flag to widen the race window, combined with frequent task cancellations (e.g., snapshot automount expiry). Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17942	2025-12-10 10:21:29 -08:00
Rob Norris	c9845a1332	Linux: work around use of GPL-only symbol `kasan_flag_enabled` We may not be able to avoid our code referencing the symbol, but we can ensure that a symbol of that name is available to the linker during build, and so not require linking the GPL-exported version. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #18009 Closes #18040	2025-12-10 10:21:29 -08:00
Alex	f8572e2a97	Fix a declaration position of the nth_page. Compilation time bug introduced by 87df5e4 commit. Fix for the compilation error(Linux kernel 6.18.0): "zfs/module/os/linux/zfs/abd_os.c:920:32: error: implicit declaration of function ‘nth_page’; did you mean ‘pte_page’? [-Werror=implicit-function-declaration]". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: agiUnderground <alex.dev.cv@gmail.com> Closes #18034	2025-12-10 10:21:29 -08:00
bspengler-oss	25d755e108	Fix HIGHMEM/kmap API violation in zfs_uiomove_bvec_impl() Fix another instance where ZFS assumes multiple pages can be mapped at once via zfs_kmap_local(), resulting in crashes and potential memory corruption on HIGHMEM-enabled (typically 32-bit) systems. Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-10 10:21:29 -08:00
bspengler-oss	5e271995d1	Fix interaction of abd_iter_map()/abd_iter_unmap() with HIGHMEM HIGHMEM kmap interfaces operate on only a single page at a time yet ZFS hadn't accounted for this, resulting in crashes and potential memory corruption on HIGHMEM (typically 32-bit) systems. This was caught by PaX's KERNSEAL feature as it makes use of HIGHMEM functionality on x64. On typical 64-bit systems, this issue wouldn't have been observed, as the map interfaces simply fall back to returning an address in lowmem where the contiguous pages can be accessed directly. Joint work with the PaX Team, tested by Mark van Dijk Reviewed-by: RageLtMan <rageltman@sempervictus> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bspengler-oss <94915855+bspengler-oss@users.noreply.github.com> Closes #15668 Closes #18030	2025-12-10 10:21:29 -08:00
Mark Johnston	a2f768f61f	FreeBSD: Fix a potential null dereference in zfs_freebsd_fsync() In general it's possible for a vnode to not have an associated VM object. This happens in particular with named pipes, which have some distinct VOPs, defined in zfs_fifoops. Thus, this chunk of zfs_freebsd_fsync() needs to check for the FIFO case, like other vm_object_mightbedirty() callers do. (Note that vn_flush_cached_data() calls are predicated on zn_has_cached_data() returning true, and it checks for a NULL v_object pointer already.) Fixes: `ef4058fcdc` Reported-by: Collin Funk <collin.funk1@gmail.com> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #18015	2025-12-10 10:21:29 -08:00
Rob Norris	ac0bc4cc00	spa_misc: add an API for spa_namespace_lock This is useful as debugging support, as it lets namespace lock operations be traced directly. It will also be useful for future work to reduce the use of spa_namespace_lock, traditionally a source of difficult deadlocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17906	2025-11-12 13:06:54 -08:00
rmacklem	583db40030	FreeBSD: Add support for _PC_CASE_INSENSITIVE FreeBSD now has a pathconf name called _PC_CASE_INSENSITIVE used to check if a file system performs case insensitive name lookups. This patch adds support for this name. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17908	2025-11-12 13:06:36 -08:00
Tony Hutter	a2a34d9212	Linux 6.17 compat: Fix broken projectquota on 6.17 We need to specifically use the FX_XFLAG_* macros in zpl_ioctl_attr() codepaths, and the FS__FL macros in the zpl_ioctl_flags() codepaths. The earlier code just assumes the FS__FL macros for both codepaths. The 6.17 kernel add a bitmask check in copy_fsxattr_from_user() that exposed this error via failing 'projectquota' ZTS tests. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17884 Closes #17869	2025-11-12 13:06:01 -08:00
Alexander Motin	67fc49433f	Cleanup ZIO_FLAG_IO_RETRY vs TRYHARD usage In cases where all issued ZIOs must succeed, and we can't do anything clever about the errors, we should just explicitly set ZIO_FLAG_TRYHARD and let OS to do all the reasonable retries. In other cases, where retries can be different from the original, for example, some ZIOs are allowed to fail due to redundancy, or we can disable aggregation on retrial to get at least some of the data, we can do first pass without TRYHARD, and only if needed retry with ZIO_FLAG_IO_RETRY (which implies TRYHARD semantics). Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17877	2025-11-12 13:05:31 -08:00
Alexander Motin	5847626175	Pass flags to more DMU write/hold functions Over the time many of DMU functions got flags argument to control prefetch, caching, etc. Few functions though left without it, even though closer look shown that many of them do not require prefetch due to their access pattern. This patch adds the flags argument to dmu_write(), dmu_buf_hold_array() and dmu_buf_hold_array_by_bonus(), passing DMU_READ_NO_PREFETCH where applicable. I am going to also pass DMU_UNCACHEDIO to some of them later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17872	2025-11-12 13:04:58 -08:00
Ryan Libby	672fea2a50	FreeBSD zio_crypt.c: initialize uio variables before access In zio_crypt_key_wrap and zio_crypt_key_unwrap, the cuio_s variable was not initialized before the calls to zfs_uio_init, leading to uninitialized access to cuio_s.uio_offset. Initialize it to avoid gcc warnings. Similar issue as fixed in `2bf152021` ("Fix gcc uninitialized warning in FreeBSD zio_crypt.c") Signed-off-by: Ryan Libby <rlibby@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17863	2025-11-12 13:04:42 -08:00
Jean-Sébastien Pédron	6f6e1c90ae	FreeBSD: zfs_getpages: Don't zero freshly allocated pages Initially, `zfs_getpages()` is provided with an array of busy pages by the vnode pager. It then tries to acquire the range lock, but if there is a concurrent `zfs_write()` running and fails to acquire that range lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`. After that, it grabs the pages again and retries to acquire the range lock, and so on. Once it got the range lock, it filters out valid pages, then copy DMU data to the remaining invalid pages. The problem is that freshly allocated zero'd pages it grabbed itself are marked as valid. Therefore they are skipped by the second part of the function and DMU data is never copied to these pages. This causes mapped pages to contain zeros instead of the expected file content. This was discovered while working on RabbitMQ on FreeBSD. I could reproduce the problem easily with the following commands: git clone https://github.com/rabbitmq/rabbitmq-server.git cd rabbitmq-server/deps/rabbit gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \ ct-amqp_client t=cluster_size_3:leader_transfer_stream_send The testsuite fails because there is a sendfile(2) that can happen concurrently to a write(2) on the same file. This leads to sendfile(2) or read(2) (after the sendfile) sending/returning data with zeros, which causes a function to crash. The patch consists of not setting the `VM_ALLOC_ZERO` flag when `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it is invalid, in case it would be partially filled with the end of the file content. Other pages are either valid (and will be skipped) or they will be entirely overwritten by the file content. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Jean-Sébastien Pédron <dumbbell@FreeBSD.org> Closes #17851	2025-10-21 09:50:43 -07:00
Rob Norris	aeff23939a	Linux 6.18: generic_drop_inode() and generic_delete_inode() renamed Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	2778832e22	Linux 6.18: namespace type moved to ns_common The namespace type has moved from the namespace ops struct to the "common" base namespace struct. Detect this and define a macro that does the right thing for both versions. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	005c631499	Linux 6.18: replace write_cache_pages() Linux 6.18 removed write_cache_pages() without a usable replacement. Here we implement a minimal zpl_write_cache_pages() that find the dirty pages within the mapping, gets them into the expected state and hands them off to zfs_putpage(), which handles the rest. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	04d0f83f4e	Linux 6.18: block_device_operations->getgeo takes struct gendisk* Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	49f078997a	Linux 6.18: convert ida_simple_* calls ida_simple_get() and ida_simple_remove() are removed in 6.18. However, since 4.19 they have been simple wrappers around ida_alloc() and ida_free(), so we can just use those directly. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Rob Norris	3fb241157f	Linux 6.18: replace nth_page() Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com>	2025-10-21 09:50:43 -07:00
Andrew Walker	084f8d0077	Fix ZFS_READONLY implementation on Linux MS-FSCC 2.6 is the governing document for DOS attribute behavior. It specifies the following: For a file, applications can read the file but cannot write to it or delete it. For a directory, applications cannot delete it, but applications can create and delete files from the directory. Signed-off-by: Andrew Walker <awalker@ixsystems.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Closes #17837	2025-10-21 09:50:43 -07:00
Dag-Erling Smørgrav	964dfc3176	FreeBSD: Correct _PC_MIN_HOLE_SIZE The actual minimum hole size on ZFS is variable, but we always report SPA_MINBLOCKSIZE, which is 512. This may lead applications to believe that they can reliably create holes at 512-byte boundaries and waste resources trying to punch holes that ZFS ends up filling anyway. * In the general case, if the vnode is a regular file, return its current block size, or the record size if the file is smaller than its own block size. If the vnode is a directory, return the dataset record size. If it is neither a regular file nor a directory, return EINVAL. * In the control directory case, always return EINVAL. Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17750	2025-10-21 09:50:43 -07:00
Tony Hutter	e09c86cb1f	zvol: verify IO type is supported ZVOLs don't support all block layer IO request types. Add a check for the IO types we do support. Also, remove references to io_is_secure_erase() since they are not supported on ZVOLs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17803	2025-10-21 09:50:43 -07:00
Tony Hutter	9079f986ae	zvol: Fix blk-mq sync The zvol blk-mq codepaths would erroneously send FLUSH and TRIM commands down the read codepath, rather than write. This fixes the issue, and updates the zvol_misc_fua test to verify that sync writes are actually happening. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #17761 Closes #17765	2025-09-29 16:50:43 -07:00
Rob Norris	15a6b982c5	linux/super: add tunable to request immediate reclaim of unused dentries Traditionally, unused dentries would be cached in the dentry cache until the associated entry is no longer on disk. The cached dentry continues to hold an inode reference, causing the inode to be pinned (see previous commit). Here we implement the dentry op d_delete, which is roughly analogous to the drop_inode superblock op, and add a zfs_delete_dentry tunable to control its behaviour. By default it continues the traditional behaviour, but when the tunable is enabled, we signal that an unused dentry should be freed immediately, releasing its inode reference, and so allowing that inode to be deleted if no longer in use. Sponsored-by: Klara, Inc. Sponsored-by: Fastmail Pty Ltd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17746	2025-09-17 16:34:14 -07:00
Rob Norris	42b9995f88	linux/super: add tunable to request immediate reclaim of unused inodes Traditionally, unused inodes would be held on the superblock inode cache until the associated on-disk file is removed or the kernel requests reclaim. On filesystems with millions of rarely-used files, this can be a lot of unusable memory. Here we implement the superblock drop_inode method, and add a zfs_delete_inode tunable to control its behaviour. By default it continues the traditional behaviour, but when the tunable is enabled, we signal that the inode should be deleted immediately when the last reference is dropped, rather than cached. This releases the associated data to the dbuf cache and ARC, allowing them to be reclaimed normally. Sponsored-by: Klara, Inc. Sponsored-by: Fastmail Pty Ltd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17746	2025-09-17 16:34:09 -07:00
Rob Norris	da33cfd436	vdev_disk_close: take disk write lock before destroying it Many IO operations are submitted to the kernel async, and so the zio can complete and followup actions before the submission call returns. If one of the followup actions closes the disk (eg during pool create/import), the initiator may be left holding a lock on the disk at destruction. Instead, take the write lock before finishing up and decoupling the disk state from the vdev proper. The caller will hold until all IO is submitted and locks released. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17719	2025-09-15 12:43:44 -07:00
Alexander Motin	b5d41deca9	FreeBSD: Satisfy ASSERT_VOP_IN_SEQC() zfs_aclset_common() might be called for newly created or not even created vnodes, that triggers assertions on newer FreeBSD versions with DEBUG_VFS_LOCKS included into INVARIANTS. In the first case make sure to call vn_seqc_write_begin()/_end(), in the second just skip the assertion. The similar has to be done for project management IOCTL and file- bases extended attributes, since those are not going through VFS. Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17722	2025-09-12 15:05:18 -07:00
Paul Dagnelie	e2e708241a	Enable zhack to work properly with 4k sector size disks Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17576	2025-09-10 15:01:32 -07:00
Rob Norris	0df91abe82	Linux 6.17: d_set_d_op() is no longer available We only have extremely narrow uses, so move it all into a single function that does only what we need, with and without d_set_d_op(). Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17621	2025-09-09 17:06:55 -07:00
rmacklem	b727163db9	zfs_vnops_os.c: Add support for the _PC_CLONE_BLKSIZE name FreeBSD now has a pathconf name called _PC_CLONE_BLKSIZE which is the block size supported for block cloning for the file system. Since ZFS's block size varies per file, return the largest size likely to be used, or zero if block cloning is not supported. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17645	2025-09-09 17:06:33 -07:00
Konstantin Belousov	28ff57505b	FreeBSD: satisfy VFS requirements for readdir() zfsctl_root_readdir(): properly set eof. readdir(): set *eofp to 1 on eof. If there were no dirents to copy out, return EINVAL same as UFS. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Konstantin Belousov <kib@FreeBSD.org> Closes #17655	2025-08-22 09:14:36 -04:00
Dag-Erling Smørgrav	2c877e8453	FreeBSD: Set st_rdev to NODEV, not 0, when not a device Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org> Closes #17649	2025-08-19 12:42:21 -07:00
Rob Norris	8a0e5e8b54	zvol: stop using zvol_state_lock to protect OS-side private data zvol_state_lock is intended to protect access to the global name->zvol lists (zvol_find_by_name()), but has also been used to control access to OS-side private data, accessed through whatever kernel object is used to represent the volume (gendisk, geom, etc). This appears to have been necessary to some degree because the OS-side object is what's used to get a handle on zvol_state_t, so zv_state_lock and zv_suspend_lock can't be used to manage access, but also, with the private object and the zvol_state_t being shutdown and destroyed at the same time in zvol_os_free(), we must ensure that the private object pointer only ever corresponds to a real zvol_state_t, not one in partial destruction. Taking the global lock seems like a convenient way to ensure this. The problem with this is that zvol_state_lock does not actually protect access to the zvol_state_t internals, so we need to take zv_state_lock and/or zv_suspend_lock. If those are contended, this can then cause OS-side operations (eg zvol_open()) to sleep to wait for them while hold zvol_state_lock. This then blocks out all other OS-side operations which want to get the private data, and any ZFS-side control operations that would take the write half of the lock. It's even worse if ZFS-side operations induce OS-side calls back into the zvol (eg creating a zvol triggers a partition probe inside the kernel, and also a userspace access from udev to set up device links). And it gets even works again if anything decides to defer those ops to a task and wait on them, which zvol_remove_minors_impl() will do under high load. However, since the previous commit, we have a guarantee that the private data pointer will always be NULL'd out in zvol_os_remove_minor() _before_ the zvol_state_t is made invalid, but it won't happen until all users are ejected. So, if we make access to the private object pointer atomic, we remove the need to take a global lockout to access it, and so we can remove all acquisitions of zvol_state_lock from the OS side. While here, I've rewritten much of the locking theory comment at the top of zvol.c. It wasn't wrong, but it hadn't been followed exactly, so I've tried to describe the purpose of each lock in a little more detail, and in particular describe where it should and shouldn't be used. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:34 -07:00
Rob Norris	96f9d271ea	zvol: remove the OS-side minor before freeing the zvol When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:21 -07:00
Brian Behlendorf	5061f959d1	Retire zfs_autoimport_disable kmod option Back in 2014 the zfs_autoimport_disable module option was added to control whether the kmods should load the pool configs from the cache file on module load. The default value since that time has been for the kernel to not process the cache file. Detecting and importing pools during boot is now controlled outside of the kmod on both Linux and FreeBSD. By all accounts this has been working well and we can remove this dormant code on the kernel side. The spa_config_load() function is has been moved to userspace, it is now only used by libzpool. Additionally, the spa_boot_init() hook which was used by FreeBSD now looks to be used and was removed. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17618	2025-08-14 14:58:58 -07:00
Rob Norris	2fd145b578	zvol: cleanup error handling and passthrough This is trying to get all the uses and non-uses of SET_ERROR correct (being: only call it if we're the originator of an error _within ZFS_), and correctly negating errors going to/from the kernel. And/or both. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17605	2025-08-08 17:04:01 -07:00
Rob Norris	90a1e13df2	Linux: zfs_sync: remove explicit suspend check Since zil_commit_flags(NOW) will always return error if the pool is suspended, there's no need for a separate suspend check here. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:50 -07:00
Rob Norris	ef4058fcdc	FreeBSD: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC) writeback. The only thing we need to do when a page writeback fails is to skip marking the page clean ("undirty"), since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an async writeback on the vnode cache first, so any dirty data has been recorded in the ZIL, ready for the followup zfs_sync()->zil_commit() to find. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:44 -07:00
Rob Norris	3d6ee9a68c	Linux: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback (kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER). The only thing we need to do when a page writeback fails is to re-mark the page dirty, since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the file mapping at the start of the sync op (fsync(), msync(), etc). An async op will get an async itx created and logged, ready for the followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit() call for every page in the range. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:38 -07:00
Rob Norris	99a5f5d1ba	ZIL: pass commit errors back to ITX callbacks ITX callbacks are used to signal that something can be cleaned up after a itx is committed. Presently that's only used when syncing out mapped pages (msync()) to mark dirty pages clean. This extends the callback interface so it can be passed an error, and take a different cleanup action if necessary. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:20 -07:00
Rob Norris	967b15b888	ZIL: allow zil_commit() to fail with error This changes zil_commit() to have an int return, and updates all callers to check it. There are no corresponding internal changes yet; it will always return 0. Since zil_commit() is an indication that the caller _really_ wants the associated data to be durability stored, I've annotated it with the __warn_unused_result__ compiler attribute (via __must_check), to emit a warning if it's ever ussd without doing something with the return code. I hope this will mean we never misuse it in the future. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:09 -07:00
Rob Norris	b270663e8a	linux/zvol_os: fix crash with blk-mq on Linux 4.19 `03987f71e3` (#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17597	2025-08-08 09:39:14 -07:00
Rob Norris	82d6f7b047	Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:42 -07:00
Rob Norris	f7bdd84328	Prefer VERIFY0P(n) over VERIFY(n == NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:37 -07:00
Rob Norris	611b95da18	Prefer VERIFY0(n) over VERIFY3S(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:32 -07:00
Rob Norris	5c7df3bcac	Prefer VERIFY0(n) over VERIFY3U(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:25 -07:00
Rob Norris	c39e076f23	Prefer VERIFY0(n) over VERIFY(n == 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:40:59 -07:00
Rob Norris	c3496b5cc6	Linux: zfs_putpage: document (and fix!) confusing sync/commit modes The structure of zfs_putpage() and its callers is tricky to follow. There's a lot more we could do to improve it, but at least now we have some description of one of the trickier bits. Writing this exposed a very subtle bug: most async pages pushed out through zpl_putpages() would go to the ZIL with commit=false, which can yield a less-efficient write policy. So this commit updates that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584	2025-08-06 09:55:58 -07:00

1 2 3 4 5 ...

1032 Commits