mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Richard Yao	d5d10f2aef	Cleanup dead spa_boot code Unused code detected by coverity. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13868	2022-09-13 16:40:10 -07:00
Richard Yao	e5327e7f97	vdev_draid_lookup_map() should not iterate outside draid_maps Coverity reported this as an out-of-bounds read. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13865	2022-09-12 12:51:17 -07:00
Richard Yao	13f2b8fb92	Fix use-after-free in btree code Coverty static analysis found these. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #10989 Closes #13861	2022-09-12 11:22:15 -07:00
Richard Yao	0e4c830bc1	Cleanup: Use OpenSolaris functions to call scheduler In our codebase, `cond_resched() and `schedule()` are Linux kernel functions that have replaced the OpenSolaris `kpreempt()` functions in the codebase to such an extent that `kpreempt()` in zfs_context.h was broken. Nobody noticed because we did not actually use it. The header had defined `kpreempt()` as `yield()`, which works on OpenSolaris and Illumos where `sched_yield()` is a wrapper for `yield()`, but that does not work on any other platform. The FreeBSD platform specific code implemented shims for these, but the shim for `schedule()` forced us to wait, which is different than merely rescheduling to another thread as the original Linux code does, while the shim for `cond_resched()` had the same definition as its kernel kpreempt() shim. After studying this, I have concluded that we should reintroduce the kpreempt() function in platform independent code with the following definitions: - In the Linux kernel: kpreempt(unused) -> cond_resched() - In the FreeBSD kernel: kpreempt(unused) -> kern_yield(PRI_USER) - In userspace: kpreempt(unused) -> sched_yield() In userspace, nothing changes from this cleanup. In the kernels, the function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and `cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on FreeBSD and `schedule()` on Linux. This makes our behavior consistent across platforms. Note that Linux's SPL continues to use `cond_resched()` and `schedule()`. However, those functions have been removed from both the FreeBSD code and userspace code. This should have the benefit of making it slightly easier to port the code to new platforms by making how things should be mapped less confusing. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Neal Gompa <ngompa@datto.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13845	2022-09-12 09:55:37 -07:00
Alexander Motin	37f6845c6f	Improve too large physical ashift handling When iterating through children physical ashifts for vdev, prefer ones above the maximum logical ashift, that we can actually use, but within the administrator defined maximum. When selecting top-level vdev ashift, do not set it to the defined maximum in case physical ashift is even higher, but just ignore one. Using the maximum does not prevent misaligned writes, but reduces space efficiency. Since ZFS tries to write data sequentially and aggregates the writes, in many cases large misanigned writes may be not as bad as the space penalty otherwise. Allow internal physical ashifts for vdevs higher than SHIFT_MAX. May be one day allocator or aggregation could benefit from that. Reduce zfs_vdev_max_auto_ashift default from 16 (64KB) to 14 (16KB), so that ZFS may still use bigger ashifts up to SHIFT_MAX (64KB), but only if it really has to or explicitly told to, but not as an "optimization". There are some read-intensive NVMe SSDs that report Preferred Write Alignment of 64KB, and attempt to build RAIDZ2 of those leads to a space inefficiency that can't be justified. Instead these changes make ZFS fall back to logical ashift of 12 (4KB) by default and only warn user that it may be suboptimal for performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #13798	2022-09-08 10:30:53 -07:00
Richard Yao	11df48ab8b	Cleanup Raid-Z Typo fixes Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13834	2022-09-06 09:43:21 -07:00
Umer Saleem	59767479ac	Add DD_FIELD string for snapshots_changed property This commit adds DD_FIELD string used in extensified dsl_dir zap object for snapshots_changed property. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13819	2022-09-02 13:33:50 -07:00
Andriy Gapon	ee9f3bca55	Add zfs.sync.snapshot_rename Only the single snapshot rename is provided. The recursive or more complex rename can be scripted. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #13802	2022-09-02 13:31:19 -07:00
Alexander Motin	f933b3fd4d	Apply arc_shrink_shift to ARC above arc_c_min It makes sense to free memory in smaller chunks when approaching arc_c_min to let other kernel subsystems to free more, since after that point we can't free anything. This also matches behavior on Linux, where to shrinker reported only the size above arc_c_min. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13794	2022-09-02 13:21:18 -07:00
Brian Behlendorf	9f346abbe8	Revert "Avoid panic with recordsize > 128k, raw sending and no large_blocks" This reverts commit `80a650b7bb`. This change inadvertently introduced a regression in ztest where one of the new ASSERTs is triggered in dsl_scan_visitbp(). Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #12275 Closes #13799	2022-08-25 13:33:32 -07:00
Umer Saleem	a582d52993	Updates for snapshots_changed property Currently, snapshots_changed property is stored in dd_props_zapobj, due to which the property is assumed to be local. This causes a difference in behavior with respect to other readonly properties. This commit stores the snapshots_changed property in dd_object. Source is not set to local in this case, which makes it consistent with other readonly properties. This commit also updates the date string format to include seconds. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13785	2022-08-24 14:20:43 -07:00
George Amanakis	0c4064d9a0	Fix zpool status in case of unloaded keys When scrubbing an encrypted filesystem with unloaded key still report an error in zpool status. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13675 Closes #13717	2022-08-22 17:42:01 -07:00
Paul Dagnelie	17e212652d	Prevent zevent list from consuming all of kernel memory There are a couple changes included here. The first is to introduce a cap on the size the ZED will grow the zevent list to. One million entries is more than enough for most use cases, and if you are overflowing that value, the problem needs to be addressed another way. The value is also tunable, for those who want the limit to be higher or lower. The other change is to add a kernel module parameter that allows snapshot creation/deletion to be exempted from the history logging; for most workloads, having these things logged is valuable, but for some workloads it produces large quantities of log spam and isn't especially helpful. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Issue #13374 Closes #13753	2022-08-22 12:36:22 -07:00
Christian Schwarz	91983265b6	Add comment on acb_zio_dummy Thanks to George Wilson for clarifying this on Slack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #13698	2022-08-08 16:55:13 -07:00
Paul Dagnelie	673aa7e6cf	Don't double-zero buffers in fault management nvlists This is a small cleanup for a trivial problem which happened to be noticed while another issue was being investigated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #13730	2022-08-04 16:53:47 -07:00
Umer Saleem	9681de4657	Add snapshots_changed as property Make dd_snap_cmtime property persistent across mount and unmount operations by storing in ZAP and restore the value from ZAP on hold into dd_snap_cmtime instead of updating it. Expose dd_snap_cmtime as 'snapshots_changed' property that provides a mechanism to quickly determine whether snapshot list for dataset has changed without having to mount a dataset or iterate the snapshot list. It specifies the time at which a snapshot for a dataset was last created or deleted. This allows us to be more efficient how often we query snapshots. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13635	2022-08-02 16:45:30 -07:00
Tino Reichardt	68aa3379ec	Skip checksum benchmarks on systems with slow cpu The checksum benchmarking on module load may take a really long time on embedded systems with a slow cpu. Avoid all benchmarks >= 1MiB on systems, where EdonR is slower then 300 MiB/s. This limit is currently hardcoded via the define LIMIT_PERF_MBS. This is the new benchmark output of a slow Intel Atom: ``` implementation 1k 4k 16k 64k 256k 1m 4m 16m edonr-generic 209 257 268 259 262 0 0 0 skein-generic 129 150 151 150 150 0 0 0 sha256-generic 50 55 56 56 56 0 0 0 sha512-generic 76 86 88 89 88 0 0 0 blake3-generic 63 62 62 62 61 0 0 0 blake3-sse2 114 292 301 307 309 0 0 0 ``` Reviewed-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13695	2022-08-01 09:51:45 -07:00
Alek P	e8cf3a4f76	Implement a new type of zfs receive: corrective receive (-c) This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes #9372	2022-07-28 15:52:46 -07:00
Ameer Hamza	3a1ce49141	Add createtxg sort support for simple snapshot iterator - When iterating snapshots with name only, e.g., "-o name -s name", libzfs uses simple snapshot iterator and results are displayed in alphabetic order. This PR adds support for faster version of createtxg sort by avoiding nvlist parsing for properties. Flags "-o name -s createtxg" will enable createtxg sort while using simple snapshot iterator. - Added support to read createtxg property directly from zfs handle for filesystem, volume and snapshot types instead of parsing nvlist. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13577	2022-07-25 14:04:46 -07:00
ixhamza	fb087146de	Add support for per dataset zil stats and use wmsum counters ZIL kstats are reported in an inclusive way, i.e., same counters are shared to capture all the activities happening in zil. Added support to report zil stats for every datset individually by combining them with already exposed dataset kstats. Wmsum uses per cpu counters and provide less overhead as compared to atomic operations. Updated zil kstats to replace wmsum counters to avoid atomic operations. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #13636	2022-07-20 17:14:06 -07:00
Alexander Motin	33dba8c792	Fix scrub resume from newly created hole It may happen that scan bookmark points to a block that was turned into a part of a big hole. In such case dsl_scan_visitbp() may skip it and dsl_scan_check_resume() will not be called for it. As result new scan suspend won't be possible until the end of the object, that may take hours if the object is a multi-terabyte ZVOL on a slow HDD pool, stretching TXG to all that time, creating all sorts of problems. This patch changes the resume condition to any greater or equal block, so even if we miss the bookmarked block, the next one we find will delete the bookmark, allowing new suspend. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13643	2022-07-20 17:02:36 -07:00
Tino Reichardt	97fd1ea42a	Fix memory allocation for the checksum benchmark Allocation via kmem_cache_alloc() is limited to less then 4m for some architectures. This commit limits the benchmarks with the linear abd cache to 1m on all architectures and adds 4m + 16m benchmarks via non-linear abd_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13669 Closes #13670	2022-07-20 17:01:32 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
Finix1979	cb01da6805	Call nvlist_free before return Fixes a small kernel memory leak which would occur if a pool failed to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from a presumably damaged MOS config. In the case of a missing key there was no leak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Finix1979 <yancw@info2soft.com> Closes #13629	2022-07-07 11:43:58 -07:00
Alexander Motin	74230a5bc1	Avoid memory copy when verifying raidz/draid parity Before this change for every valid parity column raidz_parity_verify() allocated new buffer and copied there existing data, then recalculated the parity and compared the result with the copy. This patch removes the memory copy, simply swapping original buffer pointers with newly allocated empty ones for parity recalculation and comparison. Original buffers with potentially incorrect parity data are then just freed, while new recalculated ones are used for repair. On a pool of 12 4-wide raidz vdevs, storing 1.5TB of 16MB blocks, this change reduces memory traffic during scrub by 17% and total unhalted CPU time by 25%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13613	2022-07-05 16:27:29 -07:00
Alexander Motin	1ac7d194e5	Avoid memory copies during mirror scrub Issuing several scrub reads for a block we may use the parent ZIO buffer for one of child ZIOs. If that read complete successfully, then we won't need to copy the data explicitly. If block has only one copy (typical for root vdev, which is also a mirror inside), then we never need to copy -- succeed or fail as-is. Previous code also copied data from buffer of every successfully completed child ZIO, but that just does not make any sense. On healthy N-wide mirror this saves all N+1 (or even more in case of ditto blocks) memory copies for each scrubbed block, allowing CPU to focus mostly on check-summing. For other vdev types it should save one memory copy per block copy at root vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13606	2022-07-05 16:26:20 -07:00
George Amanakis	34e5423f83	Fix dnode byteswapping If a dnode has a spill pointer, and we use DN_SLOTS_TO_BONUSLEN() then we will possibly include the spill pointer in the len calculation and it will be byteswapped. Then dnode_byteswap() will carry on and swap the spill pointer again. Fix this by using DN_MAX_BONUS_LEN() instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13002 Closes #13015	2022-06-29 17:06:16 -07:00
наб	dd66857d92	Remaining {=> const} char\|void *tag Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:59 -07:00
наб	a926aab902	Enable -Wwrite-strings Also, fix leak from ztest_global_vars_to_zdb_args() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13348	2022-06-29 14:08:54 -07:00
Alexander Motin	827322991f	Fix and disable blocks statistics during scrub Block statistics calculation during scrub I/O issue in case of sorted scrub accounted ditto blocks several times. Embedded blocks on other side were not accounted at all. This change moves the accounting from issue to scan stage, that fixes both problems and also allows to avoid pool-wide locking and the lock contention it created. Since this statistics is quite specific and is not even exposed now anywhere, disable its calculation by default to not waste CPU time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13579	2022-06-28 11:23:31 -07:00
Brian Behlendorf	c175f5ebb2	Fix -Wuse-after-free warning in dbuf_destroy() Move the use of the db pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no reason we can't invert the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_destroy': module/zfs/dbuf.c:2953:17: error: pointer 'db' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:24 -07:00
Brian Behlendorf	9619bcdefb	Fix -Wuse-after-free warning in dbuf_issue_final_prefetch_done() Move the use of the private pointer after it is freed. It's only used as a tag so a dereference would never occur, but there's no harm in inverting the order to resolve the warning. module/zfs/dbuf.c: In function 'dbuf_issue_final_prefetch_done': module/zfs/dbuf.c:3204:17: error: pointer 'private' may be used after 'free' [-Werror=use-after-free] Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:19 -07:00
Brian Behlendorf	ff7e405f83	Fix -Wattribute-warning in dsl layer The memcpy(), memmove(), and memset() functions have been annotated to perform bounds checking when using FORTIFY_SOURCE. A warning is now generted when writing beyond the end of the specified field. Alternately, the new struct_group() macro could be used to create an anonymous union member for use by memcpy(). However, since this is the only place the macro would be helpful it's preferable to restructure the code slights to avoid the need for additional compatibility code when the macro does not exist. https://lore.kernel.org/lkml/20211118183807.1283332-1-keescook@chromium.org/T/ Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:19:12 -07:00
Brian Behlendorf	b0f7dd276c	Fix -Wattribute-warning in zfs_log_xvattr() Restructure the code in zfs_log_xvattr() to use a lr_attr_end structure when accessing lr_attr_t elements located after the variable sized array. This makes the code more understandable and resolves the accessing beyond the end of the field warnings. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13528 Closes #13575	2022-06-27 14:18:57 -07:00
George Amanakis	80a650b7bb	Avoid panic with recordsize > 128k, raw sending and no large_blocks The current codebase does not support raw sending buffers with block size > 128kB when large_blocks is not active. This can happen in the codepath dsl_dataset_sync()->dmu_objset_sync()->zio_nowait() which calls back dmu_objset_write_done()->dsl_dataset_block_born(). If dsl_dataset_sync() completes its run before dsl_dataset_block_born() is called, we will end up not activating some of the necessary flags, while having blocks based on those flags written in the filesystem. A subsequent send will then panic. Fix this by directly deciding in dmu_objset_sync() whether these flags need to be activated later by dsl_dataset_sync(). Instead of panicking due to a NULL pointer dereference in dmu_dump_write() in case of a send, print out an error message. Also during scrub verify there are no contradicting filesystem flags. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12275 Closes #12438	2022-06-27 14:17:25 -07:00
Alexander Motin	1cd72b9c13	Avoid two 64-bit divisions per scanned block Change math to make it like the ARC, using multiplications instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13591	2022-06-27 11:08:21 -07:00
Alexander Motin	c0bf952c84	Several B-tree optimizations - Introduce first element offset within a leaf. It allows to reduce by ~50% average memmove() size when adding/removing elements. If the added/removed element is in the first half of the leaf, we may shift elements before it and adjust the bth_first instead of moving more elements after it. - Use memcpy() instead of memmove() when we know there is no overlap. - Switch from uint64_t to uint32_t. It does not limit anything, but 32-bit arches should appreciate it greatly in hot paths. - Store leaf capacity in struct btree to avoid 64-bit divisions. - Adjust zfs_btree_insert_into_leaf() to always result in balanced leaves after splitting, no matter where the new element was inserted. Not that we care about it much, but it should also allow B-trees with as little as two elements per leaf instead of 4 previously. When scrubbing pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks this reduces amount of time spent in memmove() inside the scan thread from 13.7% to 5.7% and total scrub time by ~15 seconds out of 9 minutes. It should also reduce spacemaps load time, but I haven't measured it. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13582	2022-06-24 13:55:58 -07:00
Alexander Motin	1c0c729ab4	Several sorted scrub optimizations - Reduce size and comparison complexity of q_exts_by_size B-tree. Previous code used two 64-bit divisions and many other operations to compare two B-tree elements. It created enormous overhead. This implementation moves the math to the upper level and stores the score in the B-tree elements themselves. Since all that we need to store in that B-tree is the extent score and offset, those can fit into single 8 byte value instead of 24 bytes of q_exts_by_addr element and can be compared with single operation. - Better decouple secondary tree logic from main range_tree by moving rt_btree_ops and related functions into dsl_scan.c as ext_size_ops. Those functions are very small to worry about the code duplication and range_tree does not need to know details such as rt_btree_compare. - Instead of accounting number of pending bytes per pool, that needs atomic on global variable per block, account the number of non-empty per-vdev queues, that change much more rarely. - When extent scan is interrupted by TXG end, continue it in the next TXG instead of selecting next best extent. It allows to avoid leaving one truncated (and so likely not the best any more) extent each TXG. On top of some other optimizations this saves about 1.5 minutes out of 10 to scrub pool of 12 SSDs, storing 1.5TB of 4KB zvol blocks. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13576	2022-06-24 09:50:37 -07:00
Brian Behlendorf	ad8b9f940c	Scrub mirror children without BPs When scrubbing a raidz/draid pool, which contains a replacing or sparing mirror with multiple online children, only one child will be read. This is not normally a serious concern because the DTL records are used to determine where a good copy of the data is. As long as the data can be read from one child the mirror vdev will use it to repair gaps in any of its children. Furthermore, even if the data which was read is corrupt the raidz code will detect this and issue its own repair I/O to correct the damage in the mirror vdev. However, in the scenario where the DTL is wrong due to silent data corruption (say due to overwriting one child) and the scrub happens to read from a child with good data, then the other damaged mirror child will not be detected nor repaired. While this is possible for both raidz and draid vdevs, it's most pronounced when using draid. This is because by default the zed will sequentially rebuild a draid pool to a distributed spare, and the distributed spare half of the mirror is always preferred since it delivers better performance. This means the damaged half of the mirror will go undetected even after scrubbing. For system administrations this behavior is non-intuitive and in a worst case scenario could result in the only good copy of the data being unknowingly detached from the mirror. This change resolves the issue by reading all replacing/sparing mirror children when scrubbing. When the BP isn't available for verification, then compare the data buffers from each child. They must all be identical, if not there's silent damage and an error is returned to prompt the top-level vdev to issue a repair I/O to rewrite the data on all of the mirror children. Since we can't tell which child was wrong a checksum error is logged against the replacing or sparing mirror vdev. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13555	2022-06-23 10:36:28 -07:00
Tino Reichardt	deb1213098	Fix memory allocation issue for BLAKE3 context The kmem_alloc(sizeof (*ctx), KM_NOSLEEP) call on FreeBSD can't be used in this code segment. Work around this by pre-allocating a percpu context array for later use. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13568	2022-06-21 14:32:09 -07:00
Alexander Motin	dd8671459f	Reduce ZIO io_lock contention on sorted scrub During sorted scrub multiple threads (one per vdev) are issuing many ZIOs same time, all using the same scn->scn_zio_root ZIO as parent. It causes huge lock contention on the single global lock on that ZIO. Improve it by introducing per-queue null ZIOs, children to that one, and using them instead as proxy. For 12 SSD pool storing 1.5TB of 4KB blocks on 80-core system this dramatically reduces lock contention and reduces scrub time from 21 minutes down to 12.5, while actual read stages (not scan) are about 3x faster, reaching 100K blocks per second per vdev. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13553	2022-06-15 14:25:08 -07:00
Allan Jude	4ff7a8fa2f	Replace ZPROP_INVAL with ZPROP_USERPROP where it means a user property Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara Inc. Closes #12676	2022-06-14 11:27:53 -07:00
Alexander Motin	87b46d63b2	Improve sorted scan memory accounting Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should count 2x sizeof (range_seg_gap_t) per node. And since average B-tree memory efficiency is about 75%, we should increase it to 3x. Previous code under-counted up to 30% of the memory usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13537	2022-06-10 10:01:46 -07:00
Tino Reichardt	985c33b132	Introduce BLAKE3 checksums as an OpenZFS feature This commit adds BLAKE3 checksums to OpenZFS, it has similar performance to Edon-R, but without the caveats around the latter. Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3 Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3 Short description of Wikipedia: BLAKE3 is a cryptographic hash function based on Bao and BLAKE2, created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real World Crypto. BLAKE3 is a single algorithm with many desirable features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and BLAKE2, which are algorithm families with multiple variants. BLAKE3 has a binary tree structure, so it supports a practically unlimited degree of parallelism (both SIMD and multithreading) given enough input. The official Rust and C implementations are dual-licensed as public domain (CC0) and the Apache License. Along with adding the BLAKE3 hash into the OpenZFS infrastructure a new benchmarking file called chksum_bench was introduced. When read it reports the speed of the available checksum functions. On Linux: cat /proc/spl/kstat/zfs/chksum_bench On FreeBSD: sysctl kstat.zfs.misc.chksum_bench This is an example output of an i3-1005G1 test system with Debian 11: implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1196 1602 1761 1749 1762 1759 1751 skein-generic 546 591 608 615 619 612 616 sha256-generic 240 300 316 314 304 285 276 sha512-generic 353 441 467 476 472 467 426 blake3-generic 308 313 313 313 312 313 312 blake3-sse2 402 1289 1423 1446 1432 1458 1413 blake3-sse41 427 1470 1625 1704 1679 1607 1629 blake3-avx2 428 1920 3095 3343 3356 3318 3204 blake3-avx512 473 2687 4905 5836 5844 5643 5374 Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1840 2458 2665 2719 2711 2723 2693 skein-generic 870 966 996 992 1003 1005 1009 sha256-generic 415 442 453 455 457 457 457 sha512-generic 608 690 711 718 719 720 721 blake3-generic 301 313 311 309 309 310 310 blake3-sse2 343 1865 2124 2188 2180 2181 2186 blake3-sse41 364 2091 2396 2509 2463 2482 2488 blake3-avx2 365 2590 4399 4971 4915 4802 4764 Output on Debian 5.10.0-9-powerpc64le system: (POWER 9) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 1213 1703 1889 1918 1957 1902 1907 skein-generic 434 492 520 522 511 525 525 sha256-generic 167 183 187 188 188 187 188 sha512-generic 186 216 222 221 225 224 224 blake3-generic 153 152 154 153 151 153 153 blake3-sse2 391 1170 1366 1406 1428 1426 1414 blake3-sse41 352 1049 1212 1174 1262 1258 1259 Output on Debian 5.10.0-11-arm64 system: (Pi400) implementation 1k 4k 16k 64k 256k 1m 4m edonr-generic 487 603 629 639 643 641 641 skein-generic 271 299 303 308 309 309 307 sha256-generic 117 127 128 130 130 129 130 sha512-generic 145 165 170 172 173 174 175 blake3-generic 81 29 71 89 89 89 89 blake3-sse2 112 323 368 379 380 371 374 blake3-sse41 101 315 357 368 369 364 360 Structurally, the new code is mainly split into these parts: - 1x cross platform generic c variant: blake3_generic.c - 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512) - 2x assembly for ARMv8 (NEON converted from SSE2) - 2x assembly for PPC64-LE (POWER8 converted from SSE2) - one file for switching between the implementations Note the PPC64 assembly requires the VSX instruction set and the kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly. Reviewed-by: Felix Dörre <felix@dogcraft.de> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Closes #10058 Closes #12918	2022-06-08 15:55:57 -07:00
Alexander Motin	42cf2ad0e4	Remove wrong assertion in log spacemap It is typical, but not generally true that if log summary has more blocks it must also have unflushed metaslabs. Normally with metaslabs flushed in order it works, but there are known exceptions, such as device removal or metaslab being loaded during its flush attempt. Before `600a02b884` if spa_flush_metaslabs() hit loading metaslab it usually stopped (unless memlimit is also exceeded), but now it may flush more metaslabs, just skipping that particular one. This increased chances of assertion to fire when the skipped metaslab is flushed on next iteration if all other metaslabs in that summary entry are already flushed out of order. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13486 Closes #13513	2022-06-01 09:54:35 -07:00
Allan Jude	2310dba9eb	Fix typo in zil_commit() comment block Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #13518	2022-05-31 15:37:46 -07:00
Kevin Jin	152d6fda54	Fix inflated quiesce time caused by lwb_tx during zil_commit() In current zil_commit() process, transaction lwb_tx is assigned in zil_lwb_write_issue(), and is committed in zil_lwb_flush_vdevs_done(). Thus, during lwb write out process, the txg is held in open or quiesing state, until zil_lwb_flush_vdevs_done() is called. If the zil's zio latency is high, it will cause txg_sync_thread() to starve. The goal here is to defer waiting for zil_lwb_flush_vdevs_done to the 'syncing' txg state. That is, in zil_sync(). In this patch, it achieves the goal without holding transaction. A new function zil_lwb_flush_wait_all() is introduced. It waits for the completion of all the zil_lwb_flush_vdevs_done() by given txg. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12321	2022-05-26 09:36:14 -07:00
Alexander Motin	6aa8c21a2a	More speculative prefetcher improvements - Make prefetch distance adaptive: up to 4MB prefetch doubles for every, hit same as before, but after that it grows by 1/8 every time the prefetch read does not complete in time to satisfy the demand. My tests show that 4MB is sufficient for wide NVMe pool to saturate single reader thread at 2.5GB/s, while new 64MB maximum allows the same thread to reach 1.5GB/s on wide HDD pool. Further distance increase may increase speed even more, but less dramatic and with higher latency. - Allow early reuse of inactive prefetch streams: streams that never saw hits can be reused immediately if there is a demand, while others can be reused after 1s of inactivity, starting with the oldest. After 2s of inactivity streams are deleted to free resources same as before. This allows by several times increase strided read performance on HDD pool in presence of simultaneous random reads, previously filling the zfetch_max_streams limit for seconds and so blocking most of prefetch. - Always issue intermediate indirect block reads with SYNC priority. Each of those reads if delayed for longer may delay up to 1024 other block prefetches, that may be not good for wide pools. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13452	2022-05-25 10:12:52 -07:00
Paul Dagnelie	7829b465a7	Cancel in-progress rebuilds when we finish removal This issue was discovered by zloop runs. When a mirror or other redundant top-level vdev has a disk failure, and the disk is replaced, the rebuild process occurs. A removal can happen while this is in progress. If the removal completes before the rebuild does, the removal process will try to free the vdev that is still in use. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #13498	2022-05-25 09:25:13 -07:00
Alexander Motin	84d0a03f3e	Refactor Log Size Limit Original Log Size Limit implementation blocked all writes in case of limit reached until the TXG is committed and the log is freed. It caused huge delays and following speed spikes in application writes. This implementation instead smoothly throttles writes, using exactly the same mechanism as used for dirty data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: jxdking <lostking2008@hotmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Issue #12284 Closes #13476	2022-05-24 09:46:35 -07:00
Rich Ercolani	f375b23c02	Tiered early abort, zstd edition It turns out that "do LZ4 and zstd-1 both fail" is a great heuristic for "don't even bother trying higher zstd tiers". By way of illustration: $ cat /incompress \| mbuffer \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_normal summary: 39.8 GiByte in 3min 40.2sec - average of 185 MiB/s $ echo 3 \| sudo tee /sys/module/zzstd/parameters/zstd_lz4_pass 3 $ cat /incompress \| mbuffer -m 4G \| zfs recv -o compression=zstd-12 evenfaster/lowcomp_1M_zstd12_patched summary: 39.8 GiByte in 48.6sec - average of 839 MiB/s $ sudo zfs list -p -o name,used,lused,ratio evenfaster/lowcomp_1M_zstd12_normal evenfaster/lowcomp_1M_zstd12_patched NAME USED LUSED RATIO evenfaster/lowcomp_1M_zstd12_normal 39549931520 42721221632 1.08 evenfaster/lowcomp_1M_zstd12_patched 39626399744 42721217536 1.07 $ python3 -c "print(39626399744 - 39549931520)" 76468224 $ I'll take 76 MB out of 42 GB for > 4x speedup. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13244	2022-05-24 09:43:22 -07:00
Brian Behlendorf	2cd0f98f4a	Verify BPs in spa_load_verify_cb() and dsl_scan_visitbp() We want `zpool import` to be highly robust and never panic, even when encountering corrupt metadata. This is already handled in the arc_read() code path, which covers most cases, but spa_load_verify_cb() relies on zio_read() and is responsible for verifying the block pointer. During import it is also possible to encounter blocks pointers which contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values. Relax the verification function slightly to allow this. Futhermore, extend dsl_scan_recurse() to verify the block pointer contents of level zero blocks which are not of type DMU_OT_DNODE or DMU_OT_OBJSET. This is handled by arc_read() in the other cases. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13124 Closes #13360	2022-05-20 10:36:14 -07:00
Andrew	00ac77464e	Expose zpool guids through kstats There are times when end-users may wish to have a fast and convenient method to get zpool guid without having to use libzfs. This commit exposes the zpool guid via kstats in similar manner to the zpool state. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #13466	2022-05-18 10:25:33 -07:00
наб	de82164518	linux: spl: generic: ddi_strto: match solaris ddi_strto(9) Recognise initial whitespace, + in both cases, and - also in unsigneds Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13434	2022-05-13 10:15:47 -07:00
Aidan Harris	493b6e5607	Fix functions without a prototype clang-15 emits the following error message for functions without a prototype: fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Aidan Harris <me@aidanharr.is> Closes #13421	2022-05-06 11:57:37 -07:00
Rich Ercolani	7bf06f7262	Corrected edge case in uncompressed ARC->L2ARC handling I genuinely don't know why this didn't come up before, but adding the LZ4 early abort pointed out this flaw, in which we're allocating a buffer of one size, and then telling the compressor that we're handing it buffers of a different size, which may be Very Different - say, allocating 512b and then telling it the inputs are 128k. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13375	2022-05-04 11:59:30 -07:00
Alexander Motin	c55b293287	Improve mg_aliquot math When calculating mg_aliquot alike to #12046 use number of unique data disks in the vdev, not the total number of children vdev. Increase default value of the tunable from 512KB to 1MB to compensate. Before this change each disk in striped pool was getting 512KB of sequential data, in 2-wide mirror -- 1MB, in 3-wide RAIDZ1 -- 768KB. After this change in all the cases each disk should get 1MB. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #13388	2022-05-04 11:33:42 -07:00
Brian Behlendorf	34dbc618f5	Reduce dbuf_find() lock contention Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405	2022-05-04 11:17:29 -07:00
Shaan Nobee	411f4a018d	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaan Nobee <sniper111@gmail.com> Closes #12662 Closes #12790	2022-05-03 13:23:26 -07:00
Jitendra Patidar	159c6fd154	Add missing replay entry in zvol_replay_vector for TX_SETSAXATTR Commit `361a7e8` (log xattr=sa create/remove/update to ZIL) introduced a TX_SETSAXATTR, but missed to add a corresponding entry in zvol_replay_vector. Adding a missing replay entry in zvol_replay_vector. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #13396 Closes #13395	2022-05-02 11:01:26 -07:00
Rich Ercolani	f2330bd156	Default zfs_max_recordsize to 16M Increase the default allowed maximum recordsize from 1M to 16M. As described in the zfs(4) man page, there are significant costs which need to be considered before using very large blocks. However, there are scenarios where they make good sense and it should no longer be necessary to artificially restrict their use behind a module option. Note that for 32-bit platforms we continue to leave this restriction in place due to the limited virtual address space available (256-512MB). On these systems only a handful of blocks could be cached at any one time severely impacting performance and potentially stability. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12830 Closes #13302	2022-04-28 15:12:24 -07:00
Alexander Motin	600a02b884	Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12789	2022-04-26 10:44:21 -07:00
George Amanakis	0409d33273	Improve zpool status output, list all affected datasets Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of #9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9175 Closes #12812	2022-04-25 17:25:42 -07:00
наб	ad9e767657	linux: module: weld all but spl.ko into zfs.ko Originally it was thought it would be useful to split up the kmods by functionality. This would allow external consumers to only load what was needed. However, in practice we've never had a case where this functionality would be needed, and conversely managing multiple kmods can be awkward. Therefore, this change merges all but the spl.ko kmod in to a single zfs.ko kmod. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13274	2022-04-20 13:28:24 -07:00
Allan Jude	310ab9d261	Improve the inline descriptions of the ARC module parameters These are displayed as the descriptions of the sysctl's on FreeBSD Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #13334	2022-04-20 13:16:25 -07:00
наб	16c3290bbe	module: zfs: vdev_removal: remove unused num_indirect Found with -Wunused-but-set-variable on Clang trunk Fixes: `a1d477c24c` ("OpenZFS 7614, 9064 - zfs device evacuation/removal") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13304	2022-04-13 11:36:47 -07:00
Jorgen Lundman	4d972ab5ae	Prefer ATTR_ in shared codebase over AT_ An earlier commit introduces AT_MODE into the shared kernel sources, instead of the preferred existing ATTR_MODE use. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Jorgen lundman <lundman@lundman.net> Closes #13293	2022-04-05 13:02:17 -07:00
наб	7db823bd61	module: zfs: dsl_bookmark: silence false-positive maybe-uninitialised Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13247 Closes #13258	2022-03-28 10:03:13 -07:00
Brian Behlendorf	460748d4ae	Switch from _Noreturn to __attribute__((noreturn)) Parts of the Linux kernel build system struggle with _Noreturn. This results in the following warnings when building on RHEL 8.5, and likely other environments. Switch to using the __attribute__((noreturn)). warning: objtool: dbuf_free_range()+0x2b8: return with modified stack frame warning: objtool: dbuf_free_range()+0x0: stack state mismatch: cfa1=7+40 cfa2=7+8 ... WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation failed, symbol will not be versioned. WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation failed, symbol will not be versioned. ... Additionally, __thread_exit() has been renamed spl_thread_exit() and made a static inline function. This was needed because the kernel will generate a warning for symbols which are __attribute__((noreturn)) and then exported with EXPORT_SYMBOL. While we could continue to use _Noreturn in user space I've also switched it to __attribute__((noreturn)) purely for consistency throughout the code base. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13238	2022-03-23 08:51:00 -07:00
наб	045aeabce6	module: zfs: arc: hdr_full_crypt_dest: drop unevaulated-only variable This explodes as -Wunused-variable on GCC 8.5.0, despite it being used, just not in an evaluated context Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13195	2022-03-18 16:53:05 -07:00
наб	d465fc5844	Forbid b{copy,zero,cmp}(). Don't include <strings.h> for <string.h> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:48 -07:00
наб	861166b027	Remove bcopy(), bzero(), bcmp() bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:42 -07:00
наб	dad2b19fff	module: zfs: zio_inject: zio_match_handler: don't << -1 Caught by UBSAN: ZI_NO_DVA is passed explicitly in zio_handle_decrypt_injection() and can be an ENOENT from zio_match_dva() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13146 Closes #13190	2022-03-13 13:18:17 -07:00
Akash B	1282274f33	Add physical device size to SIZE column in 'zpool list -v' Add physical device size/capacity only for physical devices in 'zpool list -v' instead of displaying "-" in the SIZE column. This would make it easier to see the individual device capacity and to determine which spares are large enough to replace which devices. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #12561 Closes #13106	2022-03-08 16:20:41 -08:00
Brian Behlendorf	6df43169b3	Fix ENOSPC when unlinking multiple files from full pool When unlinking multiple files from a pool at 100% capacity, it was possible for ENOSPC to be returned after the first unlink. e.g. rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0 rm: cannot remove '/mnt/fs/test1.1.0': No space left on device rm: cannot remove '/mnt/fs/test1.2.0': No space left on device After waiting for the pending deferred frees from the first unlink to be processed the remaining files can then be unlinked. This is caused by the quota limit in dsl_dir_tempreserve_impl() being temporarily decreased to the allocatable pool capacity less any deferred free space. This is resolved using the existing mechanism of returning ERESTART when over quota as long as we know enough space will shortly be available after processing the pending deferred frees. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13172	2022-03-08 09:16:35 -08:00
Alejandro Colomar	db7f1a91de	Use _Noreturn (C11; GNU89) properly A function that returns with no value is a different thing from a function that doesn't return at all. Those are two orthogonal concepts, commonly confused. pthread_create(3) expects a pointer to a start routine that has a very precise prototype: void (start_routine)(void ); However, other thread functions, such as kernel ones, expect: void (start_routine)(void ); Providing a different one is incorrect, and has only been working because the ABIs happen to produce a compatible function. We should use '_Noreturn void', since it's the natural type, and then provide a '_Noreturn void ' wrapper for pthread functions. For consistency, replace most cases of __NORETURN or __attribute__((noreturn)) by _Noreturn. _Noreturn is understood by -std=gnu89, so it should be safe to use everywhere. Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136 Ref: https://software.codidact.com/posts/285972 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com> Closes #13120	2022-03-04 16:25:22 -08:00
Jitendra Patidar	361a7e8211	log xattr=sa create/remove/update to ZIL As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] `da6c28aaf6` Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #8768 Closes #9078	2022-02-22 13:06:43 -08:00
Damian Szuberski	806739f991	Correct compilation errors reported by GCC 10/11 New `zfs_type_t` value `ZFS_TYPE_INVALID` is introduced. Variable initialization is now possible to make GCC happy. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12167 Closes #13103	2022-02-20 19:20:00 -08:00
наб	642827ecda	module: zfs: zcp_get: fix uninitialised warning Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13110	2022-02-18 09:34:56 -08:00
наб	ef70eff198	module: mark arguments used Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13110	2022-02-18 09:34:03 -08:00
George Amanakis	52a36bd41a	Enable encrypted raw sending to pools with greater ashift Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with ashift=12 results to failure when mounting pool2/encrypted (Input/Output error). Notably, the opposite, raw sending from a greater ashift to a lower one does not fail. This happens because zio_compress_write() falsely checks only ZIO_FLAG_RAW_COMPRESS and not ZIO_FLAG_RAW_ENCRYPT which is also set in encrypted raw send streams. In this case it rounds up the psize and if not equal to the zio->io_size it modifies the block by zeroing out the extra bytes. Because this happens in a SA attr. registration object (type=46), the decryption fails upon mounting the filesystem, and zpool status falsely reports an error. Fix this by checking both ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT before deciding whether to zero-pad a block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13067 Closes #13074	2022-02-16 11:52:02 -08:00
наб	df7b54f1d9	module: icp: rip out insane crypto_req_handle_t mechanism, inline KM_SLEEP Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:37 -08:00
наб	739afd9475	module: icp: fold away all key formats except CRYPTO_KEY_RAW It's the only one actually used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:07 -08:00
наб	eb1e09b7ec	module: icp: remove unused CRYPTO_ALWAYS_QUEUE Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:24:19 -08:00
Jorgen Lundman	4759342a5e	Add spa _os() hooks Add hooks for when spa is created, exported, activated and deactivated. Used by macOS to attach iokit, and lock kext as busy (to stop unloads). Userland, Linux, and, FreeBSD have empty stubs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12801	2022-02-15 15:54:25 -08:00
George Amanakis	2fb52853dc	Avoid dirtying the final TXGs when exporting a pool There are two codepaths than can dirty final TXGs: 1) If calling spa_export_common()->spa_unload()-> spa_unload_log_sm_flush_all() after the spa_final_txg is set, then spa_sync()->spa_flush_metaslabs() may end up dirtying the final TXGs. Then we have the following panic: Call Trace: <TASK> dump_stack_lvl+0x46/0x62 spl_panic+0xea/0x102 [spl] dbuf_dirty+0xcd6/0x11b0 [zfs] zap_lockdir_impl+0x321/0x590 [zfs] zap_lockdir+0xed/0x150 [zfs] zap_update+0x69/0x250 [zfs] feature_sync+0x5f/0x190 [zfs] space_map_alloc+0x83/0xc0 [zfs] spa_generate_syncing_log_sm+0x10b/0x2f0 [zfs] spa_flush_metaslabs+0xb2/0x350 [zfs] spa_sync_iterate_to_convergence+0x15a/0x320 [zfs] spa_sync+0x2e0/0x840 [zfs] txg_sync_thread+0x2b1/0x3f0 [zfs] thread_generic_wrapper+0x62/0xa0 [spl] kthread+0x127/0x150 ret_from_fork+0x22/0x30 </TASK> 2) Calling vdev_*_stop_all() for a second time in spa_unload() after spa_export_common() unnecessarily delays the final TXGs beyond what spa_final_txg is set at. Fix this by performing the check and call for spa_unload_log_sm_flush_all() before the spa_final_txg is set in spa_export_common(). Also check if the spa_final_txg has already been set in spa_unload() and skip those calls in this case. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> External-issue: https://www.illumos.org/issues/9081 Closes #13048 Closes #13098	2022-02-15 15:48:59 -08:00
Jorgen Lundman	9a70e97fe1	Rename fallthrough to zfs_fallthrough Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #13097	2022-02-15 08:58:59 -08:00
Rich Ercolani	dec1eef4c5	Silence uninitialized warnings in dsl_dataset.c On newer compilers, dsl_dataset.c now warns (or, on DEBUG, errors) on uninitialized variable usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13083	2022-02-14 10:04:50 -08:00
Attila Fülöp	68ddc06b61	Receive checks should allow unencrypted child datasets dmu_recv_begin_check() unconditionally sets the DS_HOLD_FLAG_DECRYPT flag before calling dsl_dataset_hold_flags(). If the key on the receiving side isn't loaded or the send stream contains embedded blocks, the receive check fails for a stream which is perfectly valid and could be received without any problem. This seems like a remnant of the initial design, where unencrypted datasets below encrypted ones weren't allowed. Add a condition to set `DS_HOLD_FLAG_DECRYPT` only for encrypted datasets, modify an existing test to detect this regression and add a test for raw replication streams. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Co-authored-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13033 Closes #13076	2022-02-09 14:38:33 -08:00
Tomohiro Kusumi	5f65d008e9	Remove unneeded "extern inline" function declarations All of these externs are already #included as static inline functions via corresponding headers. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #13073	2022-02-08 10:48:57 -08:00
Christian Schwarz	1dccfd7a38	zvol: make calls to platform ops static There's no need to make the platform ops dynamic dispatch. This change replaces the dynamic dispatch with static calls to the platform-specific functions. To avoid name collisions, prefix all platform-specific functions with `zvol_os_`. I actually find `zvol_..._os` slightly nicer to read in the calling code, but having it as a prefix is useful. Advantage: - easier jump-to-definition / grepping - potential benefits to static analysis - better legibility Future work: also prefix remaining `static` functions in zvol_os.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #12965	2022-02-07 10:24:38 -08:00
Alexander Motin	f2c5bc150e	Add more control/visibility to spa_load_verify(). Use error thresholds from policy to control whether to scrub data and/or metadata. If threshold is set to UINT64_MAX, then caller probably does not care about result and we may skip that part. By default import neither set the data error threshold nor read the error counter, so skip the data scrub for faster import. Metadata are still scrubbed and fail if even single error found. While there just for symmetry return number of metadata errors in case threshold is not set to zero and we haven't reached it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13022	2022-02-04 13:06:38 -08:00
Christian Schwarz	2f14adacaa	zfs_set_prop_nvlist: make it easier to spot the call to dsl_props_set Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #12963	2022-02-04 11:52:10 -08:00
Christian Schwarz	db87580076	dsl_dir_tempreserve_impl: remove unused `deferred` variable The following commit moved the users of `deferred` into function dsl_pool_unreserved_space: commit `d2734cce68` Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Date: Fri Dec 16 14:11:29 2016 -0800 OpenZFS 9166 - zfs storage pool checkpoint Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #13056	2022-02-04 10:33:34 -08:00
Pawel Jakub Dawidek	3d244b4881	Fix clearing set-uid and set-gid bits on a file when replying a write POSIX requires that set-uid and set-gid bits to be removed when an unprivileged user writes to a file and ZFS does that during normal operation. The problem arrises when the write is stored in the ZIL and replayed. During replay we have no access to original credentials of the process doing the write, so zfs_write() will be performed with the root credentials. When root is doing the write set-uid and set-gid bits are not removed from the file. To correct that, log a separate TX_SETATTR entry that removed those bits on first write to such file. Idea from: Christian Schwarz Add test for ZIL replay of setuid/setgid clearing. Improve various edge cases when clearing setid bits: - The setid bits can be readded during a single write, so make sure to check for them on every chunk write. - Log TX_SETATTR record at most once per transaction group (if the setid bits are keep coming back). - Move zfs_log_setattr() outside of zp->z_acl_lock. Reviewed-by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Christian Schwarz <me@cschwarz.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13027	2022-02-03 14:37:57 -08:00
Damian Szuberski	63652e1546	Add `--enable-asan` and `--enable-ubsan` switches `configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12928	2022-02-03 14:35:38 -08:00
George Amanakis	f3b08dfd7f	Report dnodes with faulty bonuslen In files created/modified before `4254acb` there may be a corruption of xattrs which is not reported during scrub and normal send/receive. It manifests only as an error when raw sending/receiving. This happens because currently only the raw receive path checks for discrepancies between the dnode bonus length and the spill pointer flag. In case we encounter a dnode whose bonus length is greater than the predicted one, we should report an error. Modify in this regard dnode_sync() with an assertion at the end, dump_dnode() to error out, dsl_scan_recurse() to report errors during a scrub, and zstream to report a warning when dumping. Also added a test to verify spill blocks are sent correctly in a raw send. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12720 Closes #13014	2022-02-03 14:28:19 -08:00
Ryan Moeller	15aa38690e	Simplify resume token generation * Improve naming. * Reduce indentation. * Avoid boilerplate logic duplication. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12967	2022-02-01 17:04:08 -08:00
наб	c70bb2f610	Replace *CTASSERT() with _Static_assert() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12993	2022-01-26 11:38:52 -08:00
наб	7ada752a93	Clean up CSTYLEDs 69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1) had a useful policy regarding CALL(ARG1, ARG2, ARG3); above 2 lines. As it stands, it spits out both sysctl_os.c: 385: continuation line should be indented by 4 spaces sysctl_os.c: 385: indent by spaces instead of tabs which is very cool Another >10 could be fixed by removing "ulong" &al. handling. I don't foresee anyone actually using it intentionally (does it even exist in modern headers? why did it in the first place?). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12993	2022-01-26 11:38:52 -08:00
Mark Johnston	063daa8350	Fix handling of errors from dmu_write_uio_dbuf() on FreeBSD FreeBSD's implementation of zfs_uio_fault_move() returns EFAULT when a page fault occurs while copying data in or out of user buffers. The VFS treats such errors specially and will retry the I/O operation (which may have made some partial progress). When the FreeBSD and Linux implementations of zfs_write() were merged, the handling of errors from dmu_write_uio_dbuf() changed such that EFAULT is not handled as a partial write. For example, when appending to a file, the z_size field of the znode is not updated after a partial write resulting in EFAULT. Restore the old handling of errors from dmu_write_uio_dbuf() to fix this. This should have no impact on Linux, which has special handling for EFAULT already. Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12964	2022-01-21 11:54:05 -08:00
George Amanakis	63a26454ba	Introduce a flag to skip comparing the local mac when raw sending Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12981 Closes #10523 Closes #11221 Closes #11294 Closes #12594 Issue #11300	2022-01-21 11:41:17 -08:00
Mark Johnston	6e2a59181e	Avoid memory allocations in the ARC eviction thread When the eviction thread goes to shrink an ARC state, it allocates a set of marker buffers used to hold its place in the state's sublists. This can be problematic in low memory conditions, since 1) the allocation can be substantial, as we allocate NCPU markers; 2) on at least FreeBSD, page reclamation can block in arc_wait_for_eviction() In particular, in stress tests it's possible to hit a deadlock on FreeBSD when the number of free pages is very low, wherein the system is waiting for the page daemon to reclaim memory, the page daemon is waiting for the ARC eviction thread to finish, and the ARC eviction thread is blocked waiting for more memory. Try to reduce the likelihood of such deadlocks by pre-allocating markers for the eviction thread at ARC initialization time. When evicting buffers from an ARC state, check to see if the current thread is the ARC eviction thread, and use the pre-allocated markers for that purpose rather than dynamically allocating them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12985	2022-01-21 10:28:13 -08:00
наб	18168da727	module/*.ko: prune .data, global .rodata Evaluated every variable that lives in .data (and globals in .rodata) in the kernel modules, and constified/eliminated/localised them appropriately. This means that all read-only data is now actually read-only data, and, if possible, at file scope. A lot of previously- global-symbols became inlinable (and inlined!) constants. Probably not in a big Wowee Performance Moment, but hey. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12899	2022-01-14 15:37:55 -08:00
Mark Maybee	da9c6c0333	Remove VERIFY() in vdev_props_set_sync() Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Maybee <mark.maybee@delphix.com> Closes #12951	2022-01-12 16:15:30 -08:00
Rich Ercolani	63f4bfd6ac	lz4: Cherrypick fix for CVE-2021-3520 There should be no risk of us accidentally hitting this since we'd need maliciously malformed data to wind up in the pipeline, or a very unfortunate random bit flip at exactly the right moment. Still since we can handle it we should. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12947	2022-01-12 16:14:36 -08:00
Rich Ercolani	d6c1bbdd65	Updated the lz4 decompressor As an experiment, I stole the lz4 decompressor from upstream lz4 (1.9.3), and landed it. Feedback suggested that keeping the vendor lz4 code isolated and unlinted was probably reasonable, so I lobbed it into its own file. It also seemed reasonable to put the mostly-untouched* code into lz4.c proper, and relegate the integrated and ZFS-specific code to lz4_zfs.c. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12805	2022-01-07 10:36:49 -08:00
Christian Schwarz	a8f27ec6c5	l2arc_write_buffers: remove redundant asserts Probably introduced inadvertently in `b525630` (Native Encryption). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #12935	2022-01-06 14:39:22 -08:00
наб	7c2eb1c875	zvol: remove unused variable Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12917	2022-01-06 11:20:13 -08:00
наб	c25e639f2b	fm: remove unused variables Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12917	2022-01-06 11:20:13 -08:00
Paul Dagnelie	399b98198a	Revert "zfs list: Allow more fields in ZFS_ITER_SIMPLE mode" This reverts commit `f6a0dac84a`. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12938	2022-01-06 11:12:53 -08:00
Brian Behlendorf	3c80e0742a	Verify dRAID empty sectors Verify that all empty sectors are zero filled before using them to calculate parity. Failure to do so can result in incorrect parity columns being generated and written to disk if the contents of an empty sector are non-zero. This was possible because the checksum only protects the data portions of the buffer, not the empty sector padding. This issue has been addressed by updating raidz_parity_verify() to check that all dRAID empty sectors are zero filled. Any sectors which are non-zero will be fixed, repair IO issued, and a checksum error logged. They can then be safely used to verify the parity. This specific type of damage is unlikely to occur since it requires a disk to have silently returned bad data, for an empty sector, while performing a scrub. However, if a pool were to have been damaged in this way, scrubbing the pool with this change applied will repair both the empty sector and parity columns as long as the data checksum is valid. Checksum errors will be reported in the `zpool status` output for any repairs which are made. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12857	2022-01-04 16:46:32 -08:00
наб	14e4e3cb9f	module: zfs: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:42:47 -08:00
наб	855e49e881	module: zfs: vdev: shim out vdev_indirect_mapping_verify() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:42:41 -08:00
наб	ce767d69b0	module: zfs: vdev: shim out vdev_indirect_births_verify() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:42:29 -08:00
наб	36542b065d	module: zfs: spa: shim out vdev_count_verify_zaps() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:50 -08:00
наб	2c1988e96f	module: zfs: multilist: shim out multilist_d2l() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:45 -08:00
наб	89495a427f	module: zfs: dsl: pool: shim out dsl_early_sync_task_verify() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:36 -08:00
наб	16a32ce402	module: zfs: dnode: use debug-only in debug mode only Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:31 -08:00
Alexander Motin	462217d1c2	Reduce number of arc_prune threads On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12896 Issue #9966	2021-12-22 17:07:13 -08:00
Allan Jude	f6a0dac84a	zfs list: Allow more fields in ZFS_ITER_SIMPLE mode If the fields to be listed and sorted by are constrained to those populated by dsl_dataset_fast_stat(), then zfs list is much faster, as it does not need to open each objset and reads its properties. A previous optimization by Pawel Dawidek (`0cee24064a`) took advantage of this to make listing snapshot names sorted only by name much faster. However, it was limited to `-o name -s name`, this work extends this optimization to work with: - name - guid - createtxg - numclones - inconsistent - redacted - origin and could be further extended to any other properties supported by dsl_dataset_fast_stat() or similar, that do not require extra locking or reading from disk. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #11080	2021-12-16 11:56:22 -08:00
Paul Dagnelie	376027331d	ZFS send/recv with ashift 9->12 leads to data corruption Improve the ability of zfs send to determine if a block is compressed or not by using information contained in the blkptr. Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Reviewed-by: Matthew Ahrens <matthew.ahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12770	2021-12-07 11:27:59 -07:00
Paul Dagnelie	795075e638	Add `const` to nvlist functions to properly expose their real behavior Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12728	2021-12-06 18:19:13 -07:00
Rich Ercolani	df42e20ac6	Corrected a case where we could read uninited ABD memory For my sins, I started running valgrind over ztest to try and fix that pesky intermittent "zloop dies with malloc errors" problem. This one seemed exciting enough to merit cutting a PR for before the rest get polished. Suggested-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12214	2021-12-03 13:13:21 -08:00
Brian Behlendorf	77e2756de0	Linux 5.13 compat: retry zvol_open() when contended Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #12301 Closes #12759	2021-12-01 17:07:12 -07:00
Brian Behlendorf	05b3eb6d23	Default to zfs_dmu_offset_next_sync=1 Strict hole reporting was previously disabled by default as a performance optimization. However, this has lead to confusion over the expected behavior and a variety of workarounds being adopted by consumers of ZFS. Change the default behavior to always report holes and force the TXG sync. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12746	2021-11-30 10:38:09 -08:00
Pawel Jakub Dawidek	547df81641	Code cleanups - Allocate ve_search on the stack, so we avoid allocating memory for every I/O even if the VDEV cache is disabled. - Reduce lock scope. - Avoid locking in vdev_cache_read() when the VDEV cache is disabled. - Sort file names properly. - Correct comment. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12749	2021-11-30 10:32:38 -08:00
Allan Jude	2a673e76a9	Vdev Properties Feature Add properties, similar to pool properties, to each vdev. This makes use of the existing per-vdev ZAP that was added as part of device evacuation/removal. A large number of read-only properties are exposed, many of the members of struct vdev_t, that provide useful statistics. Adds support for read-only "removing" vdev property. Adds the "allocating" property that defaults to "on" and can be set to "off" to prevent future allocations from that top-level vdev. Supports user-defined vdev properties. Includes support for properties.vdev in SYSFS. Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #11711	2021-11-30 07:46:25 -07:00
Rich Ercolani	269b5dadcf	Enable edonr in FreeBSD The code is integrated, builds fine, runs fine, there's not really any reason not to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12735	2021-11-16 12:40:10 -07:00
George Amanakis	c9d62d1356	Introduce a tunable to exclude special class buffers from L2ARC Special allocation class or dedup vdevs may have roughly the same performance as L2ARC vdevs. Introduce a new tunable to exclude those buffers from being cacheable on L2ARC. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11761 Closes #12285	2021-11-11 12:52:16 -08:00
Fedor Uporov	49d42425d6	Check l2cache vdevs pending list inside the vdev_inuse() The l2cache device could be added twice because vdev_inuse() does not check spa_l2cache for added devices. Make l2cache vdevs inuse checking logic more closer to spare vdevs. Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9153 Closes #12689	2021-11-11 11:54:15 -08:00
Brian Behlendorf	c23803be84	Restore dirty dnode detection logic In addition to flushing memory mapped regions when checking holes, commit `de198f2d95` modified the dirty dnode detection logic to check the dn->dn_dirty_records instead of the dn->dn_dirty_link. Relying on the dirty record has not be reliable, switch back to the previous method. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12745	2021-11-10 16:14:32 -08:00
Fedor Uporov	e39fe05b69	Skip spacemaps reading in case of pool readonly import The only zdb utility require to read metaslab-related data during read-only pool import because of spacemaps validation. Add global variable which will allow zdb read spacemaps in case of readonly import mode. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9095 Closes #12687	2021-11-09 12:50:39 -08:00
Brian Atkinson	345196be18	Single IO issue for raidz writes with skip sector In order to reduce contention on the vq_lock, optional skip sectors for Raidz writes can be placed into a single IO request. This is done by padding out the linear ABD for a parity column to contain the skip sector and by creating gang ABD to contain the data and skip sector for data columns. The vdev_raidz_map_alloc() function now contains specific functions for both reads and write to allocate the ABD's that will be issued down to the VDEV chldren. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-By: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #12333	2021-11-09 12:51:33 -07:00
Brian Behlendorf	de198f2d95	Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency When using lseek(2) to report data/holes memory mapped regions of the file were ignored. This could result in incorrect results. To handle this zfs_holey_common() was updated to asynchronously writeback any dirty mmap(2) regions prior to reporting holes. Additionally, while not strictly required, the dn_struct_rwlock is now held over the dirty check to prevent the dnode structure from changing. This ensures that a clean dnode can't be dirtied before the data/hole is located. The range lock is now also taken to ensure the call cannot race with zfs_write(). Furthermore, the code was refactored to provide a dnode_is_dirty() helper function which checks the dnode for any dirty records to determine its dirtiness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12724	2021-11-07 14:27:44 -07:00
Rich Ercolani	05679465ac	Revert behavior of `59eab109` on not-Linux It turns out that short-circuiting the EFAULT behavior on a short read breaks things on FreeBSD. So until there's a nicer solution, let's just revert the behavior for not-Linux. Reference: https://reviews.freebsd.org/R10:70f51f0e474ffe1fb74cb427423a2fba3637544d Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12698	2021-11-04 07:49:40 -06:00
Fedor Uporov	d5a5ec4693	Remove unused function zvol_set_volblocksize() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #12688	2021-10-26 17:07:53 -07:00
Rich Ercolani	6f57f1e382	Make dsl_scan print the pool name in dbgmsg If you've got multiple scrubs/resilvers going, it's rather helpful to know which pool each scan line refers to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes: #12674	2021-10-26 17:24:14 -06:00
Allan Jude	65ad5d1165	spa.c: Replace VERIFY(nvlist_(...) == 0) with fnvlist_ (#12678 ) The fnvlist versions of the functions are fatal if they fail, saving each call from having to include checking the result. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Allan Jude <allan@klarasystems.com>	2021-10-26 17:15:38 -06:00
Pawel Jakub Dawidek	a95c82bed8	Remove code duplication Remove code duplication by moving code responsible for partial block zeroing to a separate function: dnode_partial_zero(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12627	2021-10-18 16:50:33 -07:00
Pawel Jakub Dawidek	afbc617921	Remove FreeBSD's local copy of the dmu_buf_hold_array() function Make the main dmu_buf_hold_array() function non-static. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12628	2021-10-13 11:01:01 -07:00
Brian Behlendorf	280d0f0ce4	Export minimal zfs_refcount interfaces Lustre makes light use of the zfs_refcount interfaces which isn't a problem when using a non-debug build of OpenZFS. However, when debugging is enabled the required symbols are not exported. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12613	2021-10-11 10:54:39 -07:00
Rich Ercolani	9d1407e8f2	Correct refcount_add in dmu_zfetch refcount_add_many(foo,N) is not the same as for (i=0; i < N; i++) { refcount_add(foo); } Unfortunately, this is only actually true with debug kernels and reference_tracking_enable=1. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12589 Closes #12602	2021-10-08 11:10:34 -07:00
Tony Hutter	2a8430a260	Rescan enclosure sysfs path on import When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the enclosure sysfs path to the fault LEDs, like: vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8 However, this enclosure path doesn't get updated on successive imports even if enclosure path to the disk changes. This patch fixes the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #11950 Closes #12095	2021-10-04 12:32:16 -07:00
Attila Fülöp	60b618a967	ZFS: Remove a redundant if condition (#12598 ) Commit `0c03d21ac9` left in a redundant if condition while removing some code. Just remove it. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #12598	2021-10-02 12:50:57 -06:00
Rich Ercolani	59eab1093a	Handle partial reads in zfs_read Currently, dmu_read_uio_dnode can read 64K of a requested 1M in one loop, get EFAULT back from zfs_uiomove() (because the iovec only holds 64k), and return EFAULT, which turns into EAGAIN on the way out. EAGAIN gets interpreted as "I didn't read anything", the caller tries again without consuming the 64k we already read, and we're stuck. This apparently works on newer kernels because the caller which breaks on older Linux kernels by happily passing along a 1M read request and a 64k iovec just requests 64k at a time. With this, we now won't return EFAULT if we got a partial read. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12370 Closes #12509 Closes #12516	2021-09-20 10:30:50 -07:00
George Amanakis	2a49ebbb4d	Avoid panic in case of pool errors and missing L2ARC In case an ARC buffer is allocated only on L2ARC, and there are underlying errors in a pool with the cache device in faulty state, a panic can occur in arc_read_done()->arc_hdr_destroy()-> arc_hdr_l2arc_destroy()->arc_hdr_clear_flags() when trying to free the ARC buffer. Fix this by discarding the buffer's identity in arc_hdr_destroy(), in case the buffer is not empty, before calling arc_hdr_l2hdr_destroy(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12392	2021-09-16 09:40:15 -07:00
Brian Behlendorf	6954c22f35	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-09-14 10:17:54 -06:00
Jorgen Lundman	7443299fe0	Iterate encrypted clones at zvol_create_minor Userland figures out which encryption-root keys are required to load, and issues ZFS_IOC_LOAD_KEY. The tail section of spa_keystore_load_wkey() will call zvol_create_minors() on the encryption-root object. Any clones of the encrypted zvol will not be plumbed. This commits adds additional logic to detect if zvol has clones, and is encrypted, then adds these to the list of zvols to call zvol_create_minors() on. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12471	2021-09-13 13:27:07 -07:00
Arun KV	f82f0279ed	Fixed data integrity issue when underlying disk returns error Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Arun KV <arun.kv@datacore.com> Closes #12391 Closes #12443	2021-09-13 13:02:39 -07:00
Brian Behlendorf	b9ec4a15e5	Verify embedded blkptr's in arc_read() The block pointer verification check in arc_read() should also cover embedded block pointers. While highly unlikely, accessing a damaged block pointer can result in panic. To further harden the code extend the existing check to include embedded block pointers and add a comment explaining the rational for this sanity check. Lastly, correct a flaw in zfs_blkptr_verify() so the error count is checked even when checking a untrusted config to verify the non-pool-specific portions of a block pointer. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12535	2021-09-09 19:02:07 -06:00
Jorgen Lundman	5a54a4e051	Upstream: Add snapshot and zvol events For kernel to send snapshot mount/unmount events to zed. For kernel to send symlink creates/removes on zvol plumbing. (/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX) If zed misses the ENODEV, all errors after are EINVAL. Treat any error as kernel module failure. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12416	2021-09-09 10:44:21 -07:00
Allan Jude	a68e4b5919	Allow sending corrupt snapshots even if metadata is corrupted When zfs_send_corrupt_data is set, use the TRAVERSE_HARD flag, so traverse_visitbp() will not fail with ECKSUM if a blockpointer cannot be read, but rather will continue and send the objects it can. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-By: Klara Inc. Sponsored-By: WHC Online Solutions Inc. Closes #12541	2021-09-09 08:17:31 -06:00
Rich Ercolani	7676ffc51f	arc: Drop an incorrect assert Unfortunately, there was an overzealous assertion that was (in pretty specific circumstances) false, causing failure. This assertion was added in error, so we're removing it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #9897 Closes #12020 Closes #12246	2021-09-08 14:00:03 -07:00
Paul Dagnelie	c634320e51	Compressed receive with different ashift can result in incorrect PSIZE on disk We round up the psize to the nearest multiple of the asize or to the lsize, whichever is smaller. Once that's done, we allocate a new buffer of the appropriate size, zero the tail, and copy the data into it. This adds a small performance cost to these kinds of writes, but fixes the bookkeeping problems. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12522 Closes #8462	2021-09-08 13:52:28 -07:00
Trevor Bautista	00888c0898	Extend zpool-iostat to account for ZIO_PRIORITY_REBUILD (#12319 ) Previously, zpool-iostat did not display any data regarding rebuild I/Os in either the latency/size histograms (-w/-l/-r) or the queue data (-q). This fix essentially utilizes the existing infrastructure for tracking rebuild queue data and displays this data in the proper places within zpool-iostat's output. Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Signed-off-by: Trevor Bautista <tbautista@lanl.gov> Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2021-08-26 11:26:49 -07:00
Mark Johnston	3ee9a997a3	Initialize parity blocks before RAID-Z reconstruction benchmarking benchmark_raidz() allocates a row to benchmark parity calculation and reconstruction. In the latter case, the parity blocks are left uninitialized, leading to reports from KMSAN. Initialize parity blocks to 0xAA as we do for the data earlier in the function. This does not affect the selected RAID-Z implementation on any of several systems tested. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12473	2021-08-23 11:10:17 -07:00
Alexander Motin	6b88b4b501	Remove b_pabd/b_rabd allocation from arc_hdr_alloc() When a header is allocated for full overwrite it is a waste of time to allocate b_pabd/b_rabd for it, since arc_write() will free them without ever being touched. If it is a read or a partial overwrite then arc_read() and arc_hdr_decrypt() allocate them explicitly. Reduced memory allocation in user threads also reduces ARC eviction throttling there, proportionally increasing it in ZIO threads, that is not good. To minimize or even avoid it introduce ARC allocation reserve, allowing certain arc_get_data_abd() callers to allocate a bit longer in situations where user threads will already throttle. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12398	2021-08-17 10:15:54 -06:00
Alexander Motin	bb7ad5d326	Optimize arc_l2c_only lists assertions It is very expensive and not informative to call multilist_is_empty() for each arc_change_state() on debug builds to check for impossible. Instead implement special index function for arc_l2c_only->arcs_list, multilists, panicking on any attempt to use it. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12421	2021-08-17 09:55:34 -06:00
Alexander Motin	cfe8e960f1	Fix/improve dbuf hits accounting Instead of clearing stats inside arc_buf_alloc_impl() do it inside arc_hdr_alloc() and arc_release(). It fixes statistics being wiped every time a new dbuf is filled from the ARC. Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits. Since the hits are accounted under hash lock, replace atomics with simple increments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12422	2021-08-17 09:50:31 -06:00
Alexander Motin	7f9d9e6f39	Avoid vq_lock drop in vdev_queue_aggregate() vq_lock is already too congested for two more operations per I/O. Instead of dropping and reacquiring it inside vdev_queue_aggregate() delegate the zio_vdev_io_bypass() and zio_execute() calls for parent I/Os to callers, that drop the lock any way to execute the new I/O. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12297	2021-08-17 09:47:00 -06:00
Alexander Motin	e829a865bf	Use more atomics in refcounts Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12420	2021-08-17 09:44:34 -06:00
Allan Jude	e945e8d7f4	Restore FreeBSD sysctl processing for arc.min and arc.max Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #12161	2021-08-16 09:35:19 -06:00
Tony Nguyen	6bc61d22c4	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-08-10 11:36:26 -06:00
Alexander Motin	7eebcd2be6	Avoid small buffer copying on write It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12425	2021-07-27 16:05:47 -07:00
Brian Behlendorf	b72611f0f6	Fix format specifier warnings Commit `5dbf6c5a66` did not address these format specifier warnings since they were introduced by an unrelated commit which had not been rebased on `5dbf6c5a66` when merged. Fix them. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12435	2021-07-27 09:50:00 -07:00
Jorgen Lundman	273730d5b5	macOS can also set va_type Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12357	2021-07-26 16:38:06 -07:00
Alexander Motin	dd3bda39cf	Add comment on metaslab_class_throttle_reserve() locking Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Issue #12314 Closes #12419	2021-07-26 16:30:20 -07:00
George Amanakis	ab8a8f0745	Fixes in persistent L2ARC In l2arc_add_vdev() first decide whether the device is eligible for L2ARC rebuild or whole device trim and then add it to the list of cache devices. Otherwise l2arc_feed_thread() might already start writing on the device invalidating previous content as l2ad_hand = l2ad_start. However l2arc_rebuild_vdev() needs the device present in the cache device list to figure out its l2arc_dev_t. Fix this by moving most of l2arc_rebuild_vdev() in a new function l2arc_rebuild_dev() which does not need to search in the cache device list. In contrast to l2arc_add_vdev() we do not have to worry about l2arc_feed_thread() invalidating previous content when onlining a cache device. The device parameters (l2ad*) are not cleared when offlining the device and writing new buffers will not invalidate all previous content. In worst case only buffers that have not had their log block written to the device will be lost. Retire persist_l2arc_00{4,5,8} tests since they cover code already covered by the remaining ones. Test persist_l2arc_006 is renamed to persist_l2arc_004 and persist_l2arc_007 is renamed to persist_l2arc_005. Fix a typo in persist_l2arc_004, and remove an assertion that is not always true from l2arc_arcstats_pos. Also update an assertion in persist_l2arc_005 and explain why in a comment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12365	2021-07-26 12:30:24 -07:00
наб	037af3e0d4	Remove NOTE(CONSTCOND) and note.h These were mostly used to annotate do {} while(0)s Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:53 -07:00
наб	2c69ba6444	Normalise /FALLTHR{OUGH,U}/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:39 -07:00
наб	90f1c3c946	Prune /NOTREACHED/ This includes a simplification of mkbusy and format correctness in zhack and ztest Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:26 -07:00
наб	5dbf6c5a66	Replace /PRINTFLIKEn/ with attribute(printf) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:15 -07:00
Mark Johnston	1373709450	Initialize dn_next_type[] in the dnode constructor It seems nothing ensures that this array is zeroed when a dnode is freshly allocated, so in principle it retains the values from the previous allocation. In practice it seems to be the case that the fields should end up zeroed, but we can zero the field anyway for consistency. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-07-26 11:53:47 -07:00
Mark Johnston	3a185275a0	Zero pad bytes following TX_WRITE log data When logging a TX_WRITE record in the case where file data has to be copied from the DMU, we pad the log record size to a multiple of 8 bytes. In this case, any padding bytes should be zeroed, otherwise the contents of uninitialized memory are written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-07-26 11:53:47 -07:00
Mark Johnston	58714c2817	Zero pad bytes when allocating a ZIL record When allocating a record, we round up the allocation size to a multiple of 8. In this case, any padding bytes should be zeroed, otherwise the contents of uninitialized memory are written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-07-26 11:53:47 -07:00
Mark Johnston	03363b2f86	Initialize all fields in zfs_log_xvattr() When logging TX_SETATTR, we could otherwise fail to initialize part of the corresponding ZIL record depending on which fields are present in the xvattr. Initialize the creation time and the AV scan timestamp to zero so that uninitialized bytes are not written to the ZIL. This was found using KMSAN. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-07-26 11:53:47 -07:00
Mark Johnston	da27b8bc7f	Initialize "autoreplace" in spa_ld_get_props() spa_prop_find() may fail to find the specified property, in which case it suppresses ENOENT from zap_lookup(). In this case, the return value is left uninitialized, so spa_autoreplace was being initialized using an uninitialized stack variable. This was found using KMSAN. It appears to be a regression from commit `9eb7b46ed0`, which removed the initialization of "autoreplace" from the definition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12383	2021-07-26 11:53:47 -07:00
Rich Ercolani	f1ca7999bb	Fix unfortunate NULL in spa_update_dspace After `1325434b`, we can in certain circumstances end up calling spa_update_dspace with vd->vdev_mg NULL, which ends poorly during vdev removal. So let's not do that further space adjustment when we can't. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12380 Closes #12428	2021-07-26 10:51:30 -07:00
Alexander Motin	1b50749ce9	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12314	2021-07-21 06:40:36 -06:00
Kevin Jin	a7bd20e309	Add Module Parameter Regarding Log Size Limit * Add Module Parameters Regarding Log Size Limit zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12284	2021-07-20 09:40:24 -06:00
Alexander Motin	8172df643b	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-07-20 08:13:21 -06:00
Jorgen Lundman	e04210035e	dmu_redact.c does not call bqueue_destroy Ensure all calls to bqueue_init() has a corresponding call to bqueue_destroy() Reviewed-by: Paul Dagnelie <pcd@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12118	2021-07-20 08:08:45 -06:00
Alexander	23c13c7e80	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-07-20 08:03:33 -06:00
Ryan Moeller	de12cd2511	Remove unused fields from zvol_task_t We don't use or need the pool name or value source in the zvol tasks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #12361	2021-07-19 10:02:35 -06:00
Alexander Motin	c1b5869bab	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12300	2021-07-16 13:39:24 -06:00
Rich Ercolani	1325434b2d	Tinker with slop space accounting with dedup * Tinker with slop space accounting with dedup Do not include the deduplicated space usage in the slop space reservation, it leads to surprising outcomes. * Update spa_dedup_dspace sometimes Sometimes, we get into spa_get_slop_space() with spa_dedup_dspace=~0ULL, AKA "unset", while spa_dspace is correctly set. So call the code to update it before we use it if we hit that case. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12271	2021-07-13 09:47:57 -06:00
Alexander Motin	f7de776da2	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12279	2021-07-13 09:41:59 -06:00
George Wilson	958826be7a	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-07-10 19:00:37 -06:00
Alexander Motin	97752ba22a	Move gethrtime() calls out of vdev queue lock This dramatically reduces the lock contention on systems with slower (non-TSC) timecounters. With TSC the difference is minimal, but since this lock is pretty congested, any improvement counts. Plus I don't see any reason to do it under the lock other than the latency of the lock itself, which this change actually reduces. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12281	2021-07-06 14:38:00 -07:00
Alexander Motin	490c845efe	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12289	2021-07-01 09:30:31 -06:00
Jorgen Lundman	c6d1112bf4	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-07-01 09:28:15 -06:00
Jorgen Lundman	eca174527e	Upstream: dmu_zfetch_stream_fini leaks refcount dmu_zfetch_stream_fini() is missing calls to destroy the refcounts, leaking them and the mutex inside. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12294	2021-07-01 09:22:16 -06:00
Kevin Jin	50e09eddd0	Optimize txg_kick() process (#12274 ) Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12274	2021-07-01 09:20:27 -06:00
Alexander Motin	42afb12da7	Remove refcount from spa_config_() The only reason for spa_config_() to use refcount instead of simple non-atomic (thanks to scl_lock) variable for scl_count is tracking, hard disabled for the last 8 years. Switch to simple int scl_count reduces the lock hold time by avoiding atomic, plus makes structure fit into single cache line, reducing the locks contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12287	2021-07-01 09:16:54 -06:00
Alexander	6a19dea7f6	module/zfs: simplify ddt_stat_add() loop LLVM's Polly (ISL to be precise) is unhappy with the loop from ddt_stat_add(): CC [M] fs/zfs/zfs/ddt.o ../lib/External/isl/isl_schedule_node.c:2470: cannot insert node between set or sequence node and its filter children (building with the custom patch which adds Polly support to Kbuild) The mentioned loop is rather suboptimal. All that we need is to just treat ddt_stat_t as an array of u64 and perform 1:1 addition or substraction. This can be done in simpler for-loop with the determined index and bounds. Compiler will expand d_end - d into a number of ddt_stat_t fields at compile time. This prevents Polly from failing on this file. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12253	2021-06-29 08:26:11 -06:00
Alexander Motin	5b7053a9a5	Avoid 64bit division in multilist index functions The number of sublists in a multilist is relatively small. We dont need 64 bits to calculate an index. 32 bits is sufficient and makes the code more efficient. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12288	2021-06-29 06:59:14 -06:00
Alexander Motin	5e2c8338bf	Help compiller optimize out abd_verify() While abd_verify() does nothing when built without debug, compiler can't optimize it out by itself due to calls to external list_*() and abd_verify_scatter(). This commit makes it explicit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12280	2021-06-25 16:38:31 -07:00
Brian Behlendorf	88a4833039	Update cache file when setting compatibility property Unlike most other properties the 'compatibility' property is stored in the pool config object and not the DMU_OT_POOL_PROPS object. This had the advantage that the compatibility information is available without needing to fully import the pool (it can be read with zdb). However, this means we need to make sure to update both the copy of the config in the MOS and the cache file. This wasn't being done. This commit adds a call to spa_async_request() to ensure the copy of the config in the cache file gets updated as well as the one stored in the pool. This same change is made for the 'comment' property which suffers from the same inconsistency. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Colm Buckley <colm@tuatha.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12261 Closes #12276	2021-06-24 14:30:02 -07:00
jumbi77	86f5e0bbce	zfs_metaslab_mem_limit should be 25 instead of 75 According to current zfs man page zfs_metaslab_mem_limit should be 25 instead of 75. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: jumbi77@users.noreply.github.com Closes #12273	2021-06-24 10:02:54 -07:00
Rich Ercolani	8e739b2c9f	Annotated dprintf as printf-like ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12233	2021-06-22 21:53:45 -07:00
Antonio Russo	a81b812495	Revert Consolidate arc_buf allocation checks This reverts commit `13fac09868`. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply `10b3c7f5e4` which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11531 Closes #12227	2021-06-22 21:39:15 -07:00
Alexander Motin	29274c9f6d	Optimize small random numbers generation In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12183	2021-06-22 17:35:23 -06:00
Alexander Motin	c4c162c1e8	Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172 ) wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12172	2021-06-16 18:19:34 -06:00
George Amanakis	9ffcaa370a	Avoid deadlock when removing L2ARC devices under I/O In case we have I/O and try to remove an L2ARC device a deadlock might occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV to be dropped while holding the hash_lock. However, spa_l2cache_load() holds SCL_ALL and waits for the hash_lock in l2arc_evict(). Fix this by moving zfs_blkptr_verify() to the top top arc_read() before the hash_lock is taken. Verify the block pointer and return a checksum error if damaged rather than halting the system, by using BLK_VERIFY_LOG instead of BLK_VERIFY_HALT. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #12054	2021-06-16 18:17:42 -06:00
Matthew Ahrens	069bf406b4	vdev_draid_min_asize() ignores reserved space vdev_draid_min_asize() returns the minimum size of a child vdev. This is used when determining if a disk is big enough to replace a child. It's also used by zdb to determine how big of a child to make to test replacement. vdev_draid_min_asize() says that the child’s asize has to be at least 1/Nth of the entire draid’s asize, which is the same logic as raidz. However, this contradicts the code in vdev_draid_open(), which calculates the draid’s asize based on a reduced child size: An additional 32MB of scratch space is reserved at the end of each child for use by the dRAID expansion feature So the problem is that you can replace a draid disk with one that’s vdev_draid_min_asize(), but it actually needs to be larger to accommodate the additional 32MB. The replacement is allowed and everything works at first (since the reserved space is at the end, and we don’t try to use it yet), but when you try to close and reopen the pool, vdev_draid_open() calculates a smaller asize for the draid, because of the smaller leaf, which is not allowed. I think the confusion is that vdev_draid_min_asize() is correctly returning the amount of required allocatable space in a leaf, but the actual size of the leaf needs to be at least 32MB more than that. ztest_vdev_attach_detach() assumes that it can attach that size of device, and it actually can (the kernel/libzpool accepts it), but it then later causes zdb to not be able to open the pool. This commit changes vdev_draid_min_asize() to return the required size of the leaf, not the size that draid will make available to the metaslab allocator. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11459 Closes #12221	2021-06-13 10:48:53 -07:00
Alexander Motin	ffdf019cb3	Re-embed multilist_t storage This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-by: iXsystems, Inc. Closes #12158	2021-06-10 10:42:31 -06:00
Alexander Motin	371f88d96f	Remove pool io kstats (#12212 ) This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12212	2021-06-10 08:27:33 -07:00
Alan Somers	75b4cbf625	libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat `getfsstat(2)` is used to retrieve the list of mounted file systems, which libzfs uses when fetching properties like mountpoint, atime, setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses information in the VFS's cache, or `MNT_WAIT`, which effectively does a `statfs` on every single mounted file system in order to fetch the most up-to-date information. As far as I can tell, the only fields that libzfs cares about are the filesystem's name, mountpoint, fstypename, and mount flags. Those things are always updated on mount and unmount, so they will always be accurate in the VFS's mount cache except in two circumstances: 1) When a file system is busy unmounting 2) When a ZFS file system changes the value of a mount-overridable property like atime or setuid, but doesn't remount the file system. Right now that only happens when the property is changed by an unprivileged user who has delegated authority to change the property but not to mount the dataset. But perhaps libzfs could choose to do it for other reasons in the future. Switching to `MNT_NOWAIT` will greatly improve speed with no downside, as long as we explicitly update the mount cache whenever we change a mount-overridable property. For comparison, Illumos gets this information using the native `getmntany` and `getmntent` functions, which also use cached information. The illumos function that would refresh the cache, `resetmnttab`, is never called by libzfs. And on GNU/Linux, `getmntany` and `getmntent` don't even communicate with the kernel directly. They simply parse the file they are given, which is usually /etc/mtab or /proc/mounts. Perhaps the implementation of /proc/mounts is synchronous, ala MNT_WAIT; I don't know. Sponsored-by: Axcient Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes: #12091	2021-06-08 07:36:43 -06:00
наб	f423411c44	module/zfs: vdev_removal: spa_vdev_remove_thread: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2021-06-07 20:58:56 -07:00
наб	6bab6ee838	module/zfs: vdev_indirect: vdev_indirect_repair: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2021-06-07 20:58:51 -07:00
наб	f70f3e2fbc	module/zfs: dbuf: dbuf_read_impl: remove unused variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2021-06-07 20:58:47 -07:00
наб	f719d3b160	module/zfs: arc: arc_hdr_realloc_crypt: remove unused variables Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12187	2021-06-07 20:58:42 -07:00
Serapheim Dimitropoulos	86b5f4c121	Livelist logic should handle dedup blkptrs Update the logic to handle the dedup-case of consecutive FREEs in the livelist code. The logic still ensures that all the FREE entries are matched up with a respective ALLOC by keeping a refcount for each FREE blkptr that we encounter and ensuring that this refcount gets to zero by the time we are done processing the livelist. zdb -y no longer panics when encountering double frees Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #11480 Closes #12177	2021-06-07 13:09:07 -06:00
Alexander Motin	ea400129c3	More aggsum optimizations - Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12145	2021-06-07 09:02:47 -07:00
Alexander Motin	86706441a8	Introduce write-mostly sums wmsum counters are a reduced version of aggsum counters, optimized for write-mostly scenarios. They do not provide optimized read functions, but instead allow much cheaper add function. The primary usage is infrequently read statistic counters, not requiring exact precision. The Linux implementation is directly mapped into percpu_counter KPI. The FreeBSD implementation is directly mapped into counter(9) KPI. In user-space due to lack of better implementation mapped to aggsum. Unfortunately neither Linux percpu_counter nor FreeBSD counter(9) provide sufficient functionality to completelly replace aggsum, so it still remains to be used for several hot counters. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12114	2021-05-27 14:27:29 -06:00
Alexander Motin	2041d6eecd	Improve scrub maxinflight_bytes math. Previously, ZFS scaled maxinflight_bytes based on total number of disks in the pool. A 3-wide mirror was receiving a queue depth of 3 disks, which it should not, since it reads from all the disks inside. For wide raidz the situation was slightly better, but still a 3-wide raidz1 received a depth of 3 disks instead of 2. The new code counts only unique data disks, i.e. 1 disk for mirrors and non-parity disks for raidz/draid. For draid the math is still imperfect, since vdev_get_nparity() returns number of parity disks per group, not per vdev, but still some better than it was. This should slightly reduce scrub influence on payload for some pool topologies by avoiding excessive queuing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closing #12046	2021-05-27 10:11:39 -06:00
vermavipinkumar	dce1bf99ec	Propagate vdev state due to invalid label corruption Propagate vdev child state to parents on invalid label Add VDEV_AUX_BAD_LABEL to print_import_config() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Co-authored-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com> Signed-off-by: Vipin Kumar Verma <vipin.verma@hpe.com> Closes #12088	2021-05-25 12:32:07 -06:00
Brian Behlendorf	8fb577ae6d	Fix dRAID sequential resilver silent damage handling This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12061	2021-05-20 15:05:26 -07:00
Alexander Motin	7457b024ba	Scale worker threads and taskqs with number of CPUs While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11966	2021-05-14 09:13:53 -07:00
Paul Zuchowski	fce29d6aa4	Fix dmu_recv_stream test for resumable Use dsl_dataset_has_resume_receive_state() not dsl_dataset_is_zapified() to check if stream is resumable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #12034	2021-05-13 21:46:14 -07:00
Brian Behlendorf	6217656da3	Revert "Fix raw sends on encrypted datasets when copying back snapshots" Commit `d1d4769` takes into account the encryption key version to decide if the local_mac could be zeroed out. However, this could lead to failure mounting encrypted datasets created with intermediate versions of ZFS encryption available in master between major releases. In order to prevent this situation revert `d1d4769` pending a more comprehensive fix which addresses the mount failure case. Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11294 Issue #12025 Issue #12300 Closes #12033	2021-05-13 10:00:17 -07:00
наб	38c6d6cedd	module/zfs: remove zfs_zevent_console and zfs_zevent_cols zfs_zevent_console committed multiple printk()s per line without properly continuing them ‒ a single event could easily be fragmented across over thirty lines, making it useless for direct application zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console The niche this was supposed to fill can be better served by something akin to the all-syslog ZEDLET Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #7082 Closes #11996	2021-05-10 11:00:15 -07:00
Brian Behlendorf	93c8e91fe7	Fix dRAID self-healing short columns When dRAID performs a normal read operation only the data columns in the raid map are read from disk. This is enough information to calculate the checksum, verify it, and return the needed data to the application. It's only in the event of a checksum failure that the additional parity and any empty columns must be read since they are required for parity reconstruction. Reading these additional columns is handled by vdev_raidz_read_all() which calls vdev_draid_map_alloc_empty() to expand the raid_map_t and submit IOs for the missing columns. This all works correctly, but it fails to account for any "short" columns. These are data columns which are padded with a empty skip sector at the end. Since that empty sector is not needed for a normal read it's not read when columns is first read from disk. However, like the parity and empty columns the skip sector is needed to perform reconstruction. The fix is to mark any "short" columns as never being read by clearing the rc_tried flag when expanding the raid_map_t. This will cause the entire column to re-read from disk in the event of a checksum failure allowing the self-healing functionality to repair the block. Note that this only effects the self-healing feature because when scrubbing a pool the parity, data, and empty columns are all read initially to verify their contents. Furthermore, only blocks which contain "short" columns would be effected, and only when the memory backing the skip sector wasn't already zeroed out. This change extends the existing redundancy_raidz.ksh test case to verify self-healing (as well as resilver and scrub). Then applies the same test case to dRAID with a slightly modified version of the test script called redundancy_draid.ksh. The unused variable combrec was also removed from both test cases. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12010	2021-05-08 08:57:25 -07:00
Alexander Motin	4fb9e5638b	Simplify/fix dnode_move() for dn_zfetch Previous code tried to keep prefetch streams while moving dnode. But it was at least not updating per-stream zs_fetchback pointers, causing use-after-free on next access. Instead of that I see much easier and cleaner to just drop old prefetch state and start new from scratch. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11936 Closes #11998	2021-05-07 15:07:03 -07:00
Nathaniel Wesley Filardo	056a658dee	vdev_mirror: don't scrub/resilver devices that can't be read This ensures that we don't accumulate checksum errors against offline or unavailable devices but, more importantly, means that we don't needlessly create DTL entries for offline devices that are already up-to-date. Consider a 3-way mirror, with disk A always online (and so always with an empty DTL) and B and C only occasionally online. When A & B resilver with C offline, B's DTL will effectively be appended to C's due to these spurious ZIOs even as the resilver empties B's DTL: * These ZIOs land in vdev_mirror_scrub_done() and flag an error * That flagged error causes vdev_mirror_io_done() to see unexpected_errors, so it issues a ZIO_TYPE_WRITE repair ZIO, which inherits ZIO_FLAG_SCAN_THREAD because zio_vdev_child_io() includes that flag in ZIO_VDEV_CHILD_FLAGS. * That ZIO fails, too, and eventually zio_done() gets its hands on it and calls vdev_stat_update(). * vdev_stat_update() sees the error and this zio... * is not speculative, * is not due to EIO (but rather ENXIO, since the device is closed) * has an ->io_vd != NULL (specifically, the offline leaf device) * is a write * is for a txg != 0 (but rather the read block's physical birth txg) * has ZIO_FLAG_SCAN_THREAD asserted * So: vdev_stat_update() calls vdev_dtl_dirty() on the offline vdev. Then, when A & C resilver with B offline, that story gets replayed and C's DTL will be appended to B's. In fact, one does not need this permanently-broken-mirror scenario to induce badness: breaking a mirror with no DTLs and then scrubbing will create DTLs for all offline devices. These DTLs will persist until the entire mirror is reassembled for the duration of the resilver, which, incidentally, will not consider the devices with good data to be sources of good data in the case of a read failure. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nathaniel Wesley Filardo <nwfilardo@gmail.com> Closes #11930	2021-04-27 17:48:11 -07:00
Mateusz Guzik	309c32c954	Combine zio caches if possible This deduplicates 2 sets of caches which use the same allocation size. Memory savings fluctuate a lot, one sample result is FreeBSD running "make buildworld" saving ~180MB RAM in reduced page count associated with zio caches. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11877	2021-04-17 12:36:04 -07:00
Paul Zuchowski	f2286383d0	Fix crash in zio_done error reporting Fix NULL pointer dereference when reporting checksum error for gang block in zio_done. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #11872 Closes #11896	2021-04-16 11:00:53 -07:00
наб	375bdb2b20	module/zfs/zvol.c: purge unused zvol_volmode_cb_arg Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #11879	2021-04-15 14:55:37 -07:00
Jitendra Patidar	08795ab8d3	ZFS traverse_visitbp optimization to limit prefetch Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue. So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go. Module Param: zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing an indirect block. Local counters: prefetched: Local counter to account for number prefetch done. pidx: Index for which next prefetch to be issued. ptidx: Index at which next prefetch to be triggered. Keep "ptidx" somewhere in the middle of blocks prefetched, so that blocks prefetch read gets the enough time window before their demand read is issued. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #11802 Closes #11803	2021-04-15 13:49:27 -07:00
Brian Behlendorf	888700bc6b	ZTS: fix removal_condense_export test case It's been observed in the CI that the required 25% of obsolete bytes in the mapping can be to high a threshold for this test resulting in condensing never being triggered and a test failure. To prevent these failures make the existing zfs_condense_indirect_obsolete_pct tuning available so the obsolete percentage can be reduced from 25% to 5% during this test. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11869	2021-04-11 21:49:13 -07:00
pstef	458f82319a	Balance parentheses in parameter descriptions Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Piotr Paweł Stefaniak <pstef@freebsd.org> Closes #11882	2021-04-11 16:35:07 -07:00
Ryan Moeller	a631283b74	Move zfsdev_state_{init,destroy} to common code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #11833	2021-04-08 21:17:43 -07:00
Ryan Moeller	1dff545278	Eliminate zfsdev_get_state_impl After `3937ab20f` zfsdev_get_state_impl can become zfsdev_get_state. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #11833	2021-04-08 21:17:18 -07:00
Brian Behlendorf	600a1dc54c	Use dsl_scan_setup_check() to setup a scrub When a rebuild completes it will automatically schedule a follow up scrub to verify all of the block checksums. Before setting up the scrub execute the counterpart dsl_scan_setup_check() function to confirm the scrub can be started. Prior to this change we'd only check vdev_rebuild_active() which isn't as comprehensive, and using the check function keeps all of this logic in one place. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11849	2021-04-08 14:33:15 -07:00
Ryan Moeller	e778b0485b	Ratelimit deadman zevents as with delay zevents Just as delay zevents can flood the zevent pipe when a vdev becomes unresponsive, so do the deadman zevents. Ratelimit deadman zevents according to the same tunable as for delay zevents. Enable deadman tests on FreeBSD and add a test for deadman event ratelimiting. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11786	2021-04-07 16:23:57 -07:00
Andrea Gelmini	bf169e9f15	Fix various typos Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11774	2021-04-02 18:52:15 -07:00
Ryan Moeller	032a213e2e	Don't scale zfs_zevent_len_max by CPU count The lower bound for this scaling to too low and the upper bound is too high. Use a fixed default length of 512 instead, which is a reasonable value on any system. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11822	2021-04-01 08:45:04 -07:00
Ryan Moeller	3ba10f9a6a	Atomically check and set dropped zevent count ratelimit_dropped isn't protected by a lock and is expected to be updated atomically. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11822	2021-04-01 08:43:01 -07:00
Matthew Ahrens	2b56a63457	Use a helper function to clarify gang block size For gang blocks, `DVA_GET_ASIZE()` is the total space allocated for the gang DVA including its children BP's. The space allocated at each DVA's vdev/offset is `vdev_psize_to_asize(vd, SPA_GANGBLOCKSIZE)`. This commit makes this relationship more clear by using a helper function, `vdev_gang_header_asize()`, for the space allocated at the gang block's vdev/offset. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11744	2021-03-26 11:19:35 -07:00
Andrea Gelmini	8a915ba1f6	Removed duplicated includes Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11775	2021-03-22 12:34:58 -07:00
Alexander Motin	891568c990	Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652	2021-03-19 22:56:11 -07:00
Chunwei Chen	296a4a369b	Fix zfs_get_data access to files with wrong generation If TX_WRITE is create on a file, and the file is later deleted and a new directory is created on the same object id, it is possible that when zil_commit happens, zfs_get_data will be called on the new directory. This may result in panic as it tries to do range lock. This patch fixes this issue by record the generation number during zfs_log_write, so zfs_get_data can check if the object is valid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #10593 Closes #11682	2021-03-19 22:53:31 -07:00
Andrew	66e6d3f128	Fix regression in POSIX mode behavior Commit `235a85657` introduced a regression in evaluation of POSIX modes that require group DENY entries in the internal ZFS ACL. An example of such a POSX mode is 007. When write_implies_delete_child is set, then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling zfs_zaccess_common(). This occurs is zfs_zaccess_delete(). Unfortunately, when zfs_zaccess_aces_check hits this particular DENY ACE, zfs_groupmember() is checked to determine whether access should be denied, and since zfs_groupmember() always returns B_TRUE on Linux and so this check is failed, resulting ultimately in EPERM being returned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #11760	2021-03-19 22:50:46 -07:00
Martin Matuška	cd5b812818	Allow setting bootfs property on pools with indirect vdevs The FreeBSD boot loader relies on the bootfs property and is capable of booting from removed (indirect) vdevs. Reviewed-by Eric van Gyzen Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #11763	2021-03-19 22:46:43 -07:00
Serapheim Dimitropoulos	793c958f6f	Initialize metaslab range trees in metaslab_init = Motivation We've noticed several zloop crashes within Delphix generated due to the following sequence of events: - A device gets expanded and new metaslabas are allocated for it. These metaslabs go through `metaslab_init()` but haven't gone through `metaslab_sync_done()` yet. This meas that the only range tree that's actually set is the `ms_allocatable`. All the others are NULL. - A vdev_initialization is issues and `vdev_initialize_thread` starts processing one of these new metaslabs of the expanded vdev. - As part of `vdev_initialize_calculate_progress()` we call into `metaslab_load()` and `metaslab_load_impl()` which in turn tries to dereference the metaslabs trees that are still NULL and therefore we crash. The same failure can come up from the `vdev_trim` code paths. = This Patch We considered the following solutions to deal with this issue: [A] Add logic to `vdev_initialize/trim` to skip those new metaslabs. We decided against this as it would be good to avoid exposing this lower-level detail to higer-level operations. [B] Have `metaslab_load_impl()` return early for new metaslabs and thus never touch those range_trees that are NULL at that time. This seemed more of a work-around for the bug and not a clear-cut solution. [C] Refactor our logic so all metaslabs have their range_trees created at the time of their creatin in `metaslab_init()`. In this patch we decided to go with [C] because: (1) It doesn't expose more metaslab details to higher level operations such as vdev initialize and trim. (2) The current behavior of creating the range trees lazily in `metaslab_sync_done()` is unnecessarily complicated. (3) Always initializing the metaslab range_trees makes other parts of the codebase cleaner. For example, we used to use `ms_freed` as the reference value for knowing whether all the range_trees have been initialized. Now we no longer need to do that check in most places (and in the few that we do we use the `ms_new` boolean field now which is more readable). = Side Changes Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice in `metasloab_load_impl()`. In this patch we remove the extraneous assignment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #11737	2021-03-19 22:36:02 -07:00
Matthew Ahrens	330c6c0523	Clean up RAIDZ/DRAID ereport code The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11735	2021-03-19 16:22:10 -07:00
Matthew Ahrens	46df6e98aa	Remove unused rr_code The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11736	2021-03-17 21:57:09 -07:00
Don Brady	dd0b5c8559	Reference_tracking_enable should be a module param To make use of zfs_refcount_held tunable it should be a module parameter in open-zfs. Also, since the macros will auto-generate OS specific tunables, removed the existing zfs_refcount_held reference in module/os/freebsd/zfs/sysctl_os.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #11753	2021-03-16 14:56:17 -07:00
Mateusz Guzik	5ebe425a5b	Macroify teardown lock handling This will allow platforms to implement it as they see fit, in particular in a different manner than rrm locks. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11153	2021-03-12 15:51:39 -08:00

... 3 4 5 6 7 ...

2917 Commits