mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-26 12:12:13 +03:00

Author	SHA1	Message	Date
Gleb Smirnoff	83f359245a	FreeBSD: fix build without kernel option MAC Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gleb Smirnoff <glebius@FreeBSD.org> Closes #16446	2024-08-15 09:08:43 -07:00
Alan Somers	ed0db1cc8b	Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in #7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: https://github.com/openzfs/zfs/pull/9115 [^2]: https://github.com/openzfs/zfs/pull/8987 [^3]: https://github.com/freebsd/freebsd-src/commit/10b9d77bf1ccf2f3affafa6261692cb92cf7e992 [^4]: https://github.com/illumos/illumos-gate/commit/5aaeed5c617553c4cec6328c1f4c19079a5a495a [^5]: https://github.com/openzfs/zfs/pull/7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #16268	2024-08-09 14:32:59 -07:00
Mark Johnston	dea8fabf73	FreeBSD: Fix RLIMIT_FSIZE handling for block cloning ZFS implements copy_file_range(2) using block cloning when possible. This implementation must respect the RLIMIT_FSIZE limit. zfs_clone_range() already checks the limit, so it is safe to remove this check in zfs_freebsd_copy_file_range(). Moreover, the removed check produces false positives: the length passed to copy_file_range(2) may be larger than the input file size; as the man page notes, "for best performance, call copy_file_range() with the largest len value possible." In particular, some existing code passes SSIZE_MAX there. The check in zfs_clone_range() clamps the length to the input file's size before checking, but the removed check uses the caller supplied length, so something like $ echo a > /tmp/foo $ limits -f 1024 cat /tmp/foo > /tmp/bar fails because FreeBSD's cat(1) uses copy_file_range(2) in the manner described above. Reported-by: Philip Paeps <philip@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-08 10:43:02 -07:00
Rob Norris	670147be53	zvol: ensure device minors are properly cleaned up Currently, if a minor is in use when we try to remove it, we'll skip it and never come back to it again. Since the zvol state is hung off the minor in the kernel, this can get us into weird situations if something tries to use it after the removal fails. It's even worse at pool export, as there's now a vestigial zvol state with no pool under it. It's weirder again if the pool is subsequently reimported, as the zvol code (reasonably) assumes the zvol state has been properly setup, when it's actually left over from the previous import of the pool. This commit attempts to tackle that by setting a flag on the zvol if its minor can't be removed, and then checking that flag when a request is made and rejecting it, thus stopping new work coming in. The flag also causes a condvar to be signaled when the last client finishes. For the case where a single minor is being removed (eg changing volmode), it will wait for this signal before proceeding. Meanwhile, when removing all minors, a background task is created for each minor that couldn't be removed on the spot, and those tasks then wake and clean up. Since any new tasks are queued on to the pool's spa_zvol_taskq, spa_export_common() will continue to wait at export until all minors are removed. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14872 Closes #16364	2024-08-06 12:08:14 -07:00
Rob Norris	6c82951d11	FreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372 ) This includes the last 12.x release (now EOL) and 13.0 development versions (<1300139). Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-05 16:56:45 -07:00
Alexander Motin	cdd53fea1e	FreeBSD: Add missing memory reclamation accounting Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-08-05 09:21:29 -07:00
Alexander Motin	d4b5517ef9	Linux: Report reclaimable memory to kernel as such (#16385 ) Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to mark memory allocations that can be freed via shinker calls. It should allow kernel to tune and group such allocations for lower memory fragmentation and better reclamation under pressure. This patch marks as reclaimable most of ARC memory, directly evictable via ZFS shrinker, plus also dnode/znode/sa memory, indirectly evictable via kernel's superblock shrinker. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com>	2024-07-30 11:40:47 -07:00
Alexander Motin	55427add3c	Several improvements to ARC shrinking (#16197 ) - When receiving memory pressure signal from OS be more strict trying to free some memory. Otherwise kernel may come again and request much more. Return as result how much arc_c was actually reduced due to this request, that may be less than requested. - On Linux when receiving direct reclaim from some file system (that may be ZFS) instead of ignoring request completely, just shrink the ARC, but do not wait for eviction. Waiting there may cause deadlock. Ignoring it as before may put extra pressure on other caches and/or swap, and cause OOM if nothing help. While not waiting may result in more ARC evicted later, and may be too late if OOM killer activate right now, but I hope it to be better than doing nothing at all. - On Linux set arc_no_grow before waiting for reclaim, not after, or it may grow back while we are waiting. - On Linux add new parameter zfs_arc_shrinker_seeks to balance ARC eviction cost, relative to page cache and other subsystems. - Slightly update Linux arc_set_sys_free() math for new kernels. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-25 10:31:14 -07:00
Mark Johnston	4367312760	zvol: Fix suspend lock leaks (#16270 ) In several functions, we use a flag variable to track whether zv_suspend_lock is held. This flag was not getting reset in a particular case where we need to retry the underlying operation, resulting in a lock leak. Make sure to update the flag where necessary. Signed-off-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2024-07-10 14:27:44 -07:00
Mateusz Guzik	121a2d3354	FreeBSD: unregister mountroot eventhandler on unload Otherwise if zfs is unloaded and reroot is being used it trips over a stale pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: Rubicon Communications, LLC ("Netgate") Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #16242	2024-06-13 17:49:50 -07:00
bnovkov	20c8bdd85e	FreeBSD: Update use of UMA-related symbols in arc_available_memory Recent UMA changes repurposed the use of UMA_MD_SMALL_ALLOC in a way that breaks arc_available_memory on -CURRENT. This change ensures that arc_available_memory uses the new symbol while maintaining compatibility with older FreeBSD releases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Bojan Novković <bnovkov@FreeBSD.org> Closes #16230	2024-06-06 18:11:00 -07:00
Alexander Motin	efbef9e6cc	FreeBSD: Add zfs_link_create() error handling Originally Solaris didn't expect errors there, but they may happen if we fail to add entry into ZAP. Linux fixed it in #7421, but it was never fully ported to FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13215 Closes #16138	2024-05-16 17:56:55 -07:00
Rob Norris	3c941d1818	zdb/ztest: send dbgmsg output to stderr And, make the output fd an arg to zfs_dbgmsg_print(). This is a change in behaviour, but keeps it consistent with where crash traces go, and it's easy to argue this is what we want anyway; this is information about the task, not the actual output of the task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16181	2024-05-14 09:49:00 -07:00
Rob Norris	fa99d9cd9c	zfs_dbgmsg_print: make FreeBSD and Linux consistent FreeBSD was using fprintf(), which might not be signal-safe. Meanwhile, Linux's locking did not cover the header output. This two quirks are unrelated, but both have the same response: be like the other one. So with this commit, both functions are the same except for the names of their lock and list variables. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16181	2024-05-14 09:48:56 -07:00
chenqiuhao1997	41ae864b69	Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN. In P2ALIGN, the result would be incorrect when align is unsigned integer and x is larger than max value of the type of align. In that case, -(align) would be a positive integer, which means high bits would be zero and finally stay zero after '&' when align is converted to a larger integer type. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Youzhong Yang <yyang@mathworks.com> Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com> Closes #15940	2024-05-10 08:47:21 -07:00
Alan Somers	21bc066ece	Fix updating the zvol_htable when renaming a zvol When renaming a zvol, insert it into zvol_htable using the new name, not the old name. Otherwise some operations won't work. For example, "zfs set volsize" while the zvol is open. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Alan Somers <asomers@FreeBSD.org> Closes #16127 Closes #16128	2024-04-25 14:24:52 -07:00
Rob Norris	d7605ae77b	zio: rename ZIO_TYPE_IOCTL to ZIO_TYPE_FLUSH The only possible ioctl is a flush, and any other kind of meta-operation introduced in the future is likely to have different semantics (much like trim did). So, lets just call it what it is. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:23 -07:00
Rob Norris	c9c838aa1f	zio: remove io_cmd and DKIOCFLUSHWRITECACHE There's no other options, so we can just always assume its a flush. Includes some light refactoring where a switch statement was doing control flow that no longer works. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16064	2024-04-11 17:17:11 -07:00
Rob Norris	390b448726	abd: add page iterator The regular ABD iterators yield data buffers, so they have to map and unmap pages into kernel memory. If the caller only wants to count chunks, or can use page pointers directly, then the map/unmap is just unnecessary overhead. This adds adb_iterate_page_func, which yields unmapped struct page instead. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #15533 Closes #15588	2024-03-25 16:50:35 -07:00
Brian Behlendorf	6dccdf501e	BRT: Fix FICLONE/FICLONERANGE shortened copy On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls are expected to either fully clone the specified range or return an error. The range may be for an entire file. While internally ZFS supports cloning partial ranges there's no way to return the length cloned to the caller so we need to make this all or nothing. As part of this change support for the REMAP_FILE_CAN_SHORTEN flag has been added. When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range() will return a shortened range when encountering pending dirty records. When it's clear zfs_clone_range() will block and wait for the records to be written out allowing the blocks to be cloned. Furthermore, the file range lock is held over the region being cloned to prevent it from being modified while cloning. This doesn't quite provide an atomic semantics since if an error is encountered only a portion of the range may be cloned. This will be converted to an error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the caller. However, the destination file range is left in an undefined state. A test case has been added which exercises this functionality by verifying that `cp --reflink=never\|auto\|always` works correctly. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #15728 Closes #15842	2024-02-05 16:44:45 -08:00
Pawel Jakub Dawidek	a4bf6baaeb	Fix file descriptor leak on pool import. Descriptor leak can be easily reproduced by doing: # zpool import tank # sysctl kern.openfiles # zpool export tank; zpool import tank # sysctl kern.openfiles We were leaking four file descriptors on every import. Similar leak most likely existed when using file-based VDEVs. External-issue: https://reviews.freebsd.org/D43529 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15630	2024-01-23 15:03:48 -08:00
Alexander Motin	9b1677fb5a	dmu: Allow buffer fills to fail When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665	2023-12-15 09:51:41 -08:00
Mark Johnston	11656234b5	FreeBSD: Ensure that zfs_getattr() initializes the va_rdev field Otherwise the field is left uninitialized, leading to a possible kernel memory disclosure to userspace or to the network. Use the same initialization value we use in zfsctl_common_getattr(). Reported-by: KMSAN Sponsored-by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ed Maste <emaste@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #15639	2023-12-07 08:20:11 -08:00
Alexander Motin	2a27fd4111	ZIL: Assert record sizes in different places This should make sure we have log written without overflows. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15517	2023-11-28 13:35:14 -08:00
rmacklem	acb33ee1c1	FreeBSD: Fix ZFS so that snapshots under .zfs/snapshot are NFS visible Call vfs_exjail_clone() for mounts created under .zfs/snapshot to fill in the mnt_exjail field for the mount. If this is not done, the snapshots under .zfs/snapshot with not be accessible over NFS. This version has the argument name in vfs.h fixed to match that of the name in spl_vfs.c, although it really does not matter. External-issue: https://reviews.freebsd.org/D42672 Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #15563	2023-11-27 16:31:03 -08:00
Alan Somers	126efb5889	FreeBSD: Fix the build on FreeBSD 12 It was broken for several reasons: * VOP_UNLOCK lost an argument in 13.0. So OpenZFS should be using VOP_UNLOCK1, but a few direct calls to VOP_UNLOCK snuck in. * The location of the zlib header moved in 13.0 and 12.1. We can drop support for building on 12.0, which is EoL. * knlist_init lost an argument in 13.0. OpenZFS change `9d0887402b` assumed 13.0 or later. * FreeBSD 13.0 added copy_file_range, and OpenZFS change `67a1b03791` assumed 13.0 or later. Sponsored-by: Axcient Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #15551	2023-11-27 12:58:03 -08:00
Alexander Motin	a490875103	ZIL: Refactor TX_WRITE encryption similar to TX_CLONE_RANGE It should be purely textual change to make the code more readable. Should cause no functional difference. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513	2023-11-27 09:56:30 -08:00
Alexander Motin	27d8c23c58	ZIL: Do not encrypt block pointers in lr_clone_range_t In case of crash cloned blocks need to be claimed on pool import. It is only possible if they (lr_bps) and their count (lr_nbps) are not encrypted but only authenticated, similar to block pointer in lr_write_t. Few other fields can be and are still encrypted. This should fix panic on ZIL claim after crash when block cloning is actively used. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Reviewed-by: Sean Eric Fagan <sef@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15543 Closes #15513	2023-11-27 09:53:32 -08:00
Brooks Davis	cd67bc0ae4	freebsd: remove __FBSDID macro use With FreeBSD's switch to git the $FreeBSD$ string is no longer expanded and they have mostly been removed upstream. Stop using __FBSDID and remove the no-longer needed sys/cdefs.h includes. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brooks Davis <brooks.davis@sri.com> Closes #15527	2023-11-17 14:02:09 -08:00
Alexander Motin	22c8c33a58	Use abd_zero_off() where applicable In several places abd_zero() cleaned ABD filled at the next line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15514	2023-11-17 13:28:32 -08:00
Rich Ercolani	03e9caaec0	Add a tunable to disable BRT support. Copy the disable parameter that FreeBSD implemented, and extend it to work on Linux as well, until we're sure this is stable. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #15529	2023-11-16 11:35:22 -08:00
Don Brady	5caeef02fa	RAID-Z expansion feature This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks). == Initiating expansion == A new device (disk) can be attached to an existing RAIDZ vdev, by running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank raidz2-0 sda`. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes. The `feature@raidz_expansion` on-disk feature flag must be `enabled` to initiate an expansion, and it remains `active` for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software. == During expansion == The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device). The expansion progress can be monitored with `zpool status`. Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete). The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off. == After expansion == When the expansion completes, the additional space is available for use, and is reflected in the `available` zfs property (as seen in `zfs list`, `df`, etc). Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion). A RAIDZ vdev can be expanded multiple times. After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to `zfs list`, `df`, `ls -s`, and similar tools. Sponsored-by: The FreeBSD Foundation Sponsored-by: iXsystems, Inc. Sponsored-by: vStack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Authored-by: Matthew Ahrens <mahrens@delphix.com> Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com> Contributions-by: Stuart Maybee <stuart.maybee@comcast.net> Contributions-by: Thorsten Behrens <tbehrens@outlook.com> Contributions-by: Fmstrat <nospam@nowsci.com> Contributions-by: Don Brady <dev.fs.zfs@gmail.com> Signed-off-by: Don Brady <dev.fs.zfs@gmail.com> Closes #15022	2023-11-08 10:19:41 -08:00
Gordon Tetlow	dc45a00eac	Add kern.features.zfs Add a ZFS feature flag to indicate OpenZFS availability. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gordon Tetlow <gordon@freebsd.org> Closes #15484	2023-11-07 13:21:56 -08:00
Alan Somers	e36ff84c33	Update the kstat dataset_name when renaming a zvol Add a dataset_kstats_rename function, and call it when renaming a zvol on FreeBSD and Linux. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored-by: Axcient Closes #15482 Closes #15486	2023-11-07 11:34:50 -08:00
Alexander Motin	799e09f75a	Unify arc_prune_async() code There is no sense to have separate implementations for FreeBSD and Linux. Make Linux code shared as more functional and just register FreeBSD-specific prune callback with arc_add_prune_callback() API. Aside of code cleanup this should fix excessive pruning on FreeBSD: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15456	2023-10-30 16:56:04 -07:00
Alexander Motin	c3773de168	ZIL: Cleanup sync and commit handling ZVOL: - Mark all ZVOL ZIL transactions as sync. Since ZVOLs have only one object, it makes no sense to maintain async queue and on each commit merge it into sync. Single sync queue is just cheaper, while it changes nothing until actual commit request arrives. - Remove zsd_sync_cnt and the zil_async_to_sync() calls since we are no longer switching between sync and async queues. ZFS: - Mark write transactions as sync based only on number of sync opens (z_sync_cnt). We can not randomly jump between sync and async unless we want data corruptions due to writes reordering. - When file first opened with O_SYNC (z_sync_cnt incremented to 1) call zil_async_to_sync() for it to preserve correct ordering between past and future writes. - Drop zfs_fsyncer_key logic. Looks like it was an optimization for workloads heavily intermixing async writes with tons of fsyncs. But first it was broken 8 years ago due to Linux tsd implementation not allowing data storage between syscalls, and second, I doubt it is safe to switch from async to sync so often and without calling zil_async_to_sync(). - Rename sync argument of *_log_write() into commit, now only signalling caller's intent to call zil_commit() soon after. It allows WR_COPIED optimizations without extra other meanings. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15366	2023-10-30 14:51:56 -07:00
Alexander Motin	008baa091f	FreeBSD: Reduce divergence from in-tree sources This includes random small tweaks, primarily a build fixes, required when ZFS is built as part of FreeBSD base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15368	2023-10-09 13:27:18 -07:00
Alexander Motin	342357cd9e	Reduce number of metaslab preload taskq threads. Before this change ZFS created threads for 50% of CPUs for each top- level vdev. Plus it created the same number of threads for embedded log groups (that have only one metaslab and don't need any preload). As result, on system with 80 CPUs and pool of 60 vdevs this resulted in 4800 metaslab preload threads, that is absolutely insane. This patch changes the preload threads to 50% of CPUs in one taskq per pool, so on the mentioned system it will be only 40 threads. Among other things this fixes zdb on the mentioned system and pool on FreeBSD, that failed to create so many threads in one process. Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15319	2023-10-06 09:04:00 -07:00
Mateusz Guzik	ee720ad7bc	Retire z_nr_znodes Added in `ab26409db7` ("Linux 3.1 compat, super_block->s_shrink"), with the only consumer which needed the count getting retired in `066e825221` ("Linux compat: Minimum kernel version 3.10"). The counter gets in the way of not maintaining the list to begin with. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #15274	2023-09-18 16:53:33 -07:00
ednadolski-ix	95f71c019d	Selectable block allocators ZFS historically has had several space allocators that were dynamically selectable. While these have been retained in OpenZFS, only a single allocator has been statically compiled in. This patch compiles all allocators for OpenZFS and provides a module parameter to allow for manual selection between them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com> Closes #15218	2023-09-01 18:00:30 -07:00
Rob N	cae502c175	copy_file_range: fix fallback when source create on same txg In `019dea0a5` we removed the conversion from EAGAIN->EXDEV inside zfs_clone_range(), but forgot to add a test for EAGAIN to the copy_file_range() entry points to trigger fallback to a content copy. This commit fixes that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #15170 Closes #15172	2023-08-14 17:34:14 -07:00
oromenahar	019dea0a55	zfs_clone_range should return a descriptive error codes Return the more descriptive error codes instead of `EXDEV` when the parameters don't match the requirements of the clone function. Updated the comments in `brt.c` accordingly. The first three errors are just invalid parameters, which zfs can not handle. The fourth error indicates that the block which should be cloned is created and cloned or modified in the same transaction group (`txg`). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15148	2023-08-08 09:37:06 -07:00
Alexander Motin	6c94e64963	Refactor dmu_prefetch(). - Split dmu_prefetch_dnode() from dmu_prefetch() into a separate function. It is quite inconvenient to read the code where len = 0 means dnode prefetch instead indirect/data prefetch. One function doing both has no benefits, since the code paths are independent. - Improve dmu_prefetch() handling of long block ranges. Instead of limiting L0 data length to prefetch for to dmu_prefetch_max, make dmu_prefetch_max limit the actual amount of prefetch at the specified level, and, if there is more, prefetch all the rest at higher indirection level. It should improve random access times within the prefetched range of any length, reducing importance of specific dmu_prefetch_max value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15076	2023-08-07 13:54:41 -07:00
oromenahar	5bdfff5cfc	BRT should return EOPNOTSUPP Return the more descriptive EOPNOTSUPP instead of EXDEV when the storage pool doesn't support block cloning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Kay Pedersen <mail@mkwg.de> Closes #15097	2023-07-27 11:32:34 -07:00
Alexander Motin	736d5962b4	FreeBSD: Fix build on stable/13 after 1302506. Starting approximately from version 1302506 vn_lock_pair() grown two additional arguments following head. There is a one week hole, but that is closet reference point we have. Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15047	2023-07-13 08:50:34 -07:00
Alexander Motin	70ea484e3e	Finally drop long disabled vdev cache. It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14953	2023-06-09 12:40:55 -07:00
Alexander Motin	b3ad3f48d9	Use list_remove_head() where possible. ... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14955	2023-06-09 10:12:52 -07:00
Rob Norris	2b9f8ba673	znode: expose zfs_get_zplprop to libzpool There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #14642	2023-06-05 11:53:44 -07:00
Ameer Hamza	14ba8ab97d	Prevent panic during concurrent snapshot rollback and zvol read Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent release of the dnode, avoiding panic when a snapshot is rolled back in parallel during ongoing zvol read operation. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14839	2023-05-09 17:56:35 -07:00
Brian Behlendorf	b5411618f7	Fix checkstyle warning Resolve a missed checkstyle warning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14799	2023-04-26 11:49:16 -07:00

1 2 3 4 5 ...

339 Commits