mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	973934b965	Increase default zfs_rebuild_vdev_limit to 64MB When testing distributed rebuild performance with more capable hardware it was observed than increasing the zfs_rebuild_vdev_limit to 64M reduced the rebuild time by 17%. Beyond 64MB there was some improvement (~2%) but it was not significant when weighed against the increased memory usage. Memory usage is capped at 1/4 of arc_c_max. Additionally, vr_bytes_inflight_max has been moved so it's updated per-metaslab to allow the size to be adjust while a rebuild is running. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:02:24 -08:00
Brian Behlendorf	c0aea7cf4e	Increase default zfs_scan_vdev_limit to 16MB For HDD based pools the default zfs_scan_vdev_limit of 4M per-vdev can significantly limit the maximum scrub performance. Increasing the default to 16M can double the scrub speed from 80 MB/s per disk to 160 MB/s per disk. This does increase the memory footprint during scrub/resilver but given the performance win this is a reasonable trade off. Memory usage is capped at 1/4 of arc_c_max. Note that number of outstanding I/Os has not changed and is still limited by zfs_vdev_scrub_max_active. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:01:13 -08:00
Alexander Motin	dc5c8006f6	Prefetch on deadlists merge During snapshot deletion ZFS may issue several reads for each deadlist to merge them into next snapshot's or pool's bpobj. Number of the dead lists increases with number of snapshots. On HDD pools it may take significant time during which sync thread is blocked. This patch introduces prescient prefetch of required blocks for up to 128 deadlists ahead. Tests show reduction of time required to delete dataset with 720 snapshots with randomly overwritten file on wide HDD pool from 75-85 to 22-28 seconds. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14276 Closes #14402	2023-01-25 11:30:24 -08:00
Brian Behlendorf	c85ac731a0	Improve resilver ETAs When resilvering the estimated time remaining is calculated using the average issue rate over the current pass. Where the current pass starts when a scan was started, or restarted, if the pool was exported/imported. For dRAID pools in particular this can result in wildly optimistic estimates since the issue rate will be very high while scanning when non-degraded regions of the pool are scanned. Once repair I/O starts being issued performance drops to a realistic number but the estimated performance is still significantly skewed. To address this we redefine a pass such that it starts after a scanning phase completes so the issue rate is more reflective of recent performance. Additionally, the zfs_scan_report_txgs module option can be set to reset the pass statistics more often. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14410	2023-01-25 11:28:54 -08:00
Alexander Motin	0f740a4f1d	Introduce minimal ZIL block commit delay Despite all optimizations, tests on actual hardware show that FreeBSD kernel can't sleep for less then ~2us. Similar tests on Linux show ~50us delay at least from nanosleep() (haven't tested inside kernel). It means that on very fast log device ZIL may not be able to satisfy zfs_commit_timeout_pct block commit timeout, increasing log latency more than desired. Handle that by introduction of zil_min_commit_timeout parameter, specifying minimal timeout value where additional delays to aggregate writes may be skipped. Also skip delays if the LWB is more than 7/8 full, that often happens if I/O sizes are constant and match one of LWB sizes. Both things are applied only if there were no already outstanding log blocks, that may indicate single-threaded workload, that by definition can not benefit from the commit delays. While there, add short time moving average to zl_last_lwb_latency to make it more stable. Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of 1 there IOPS increase by 5.5% instead of expected 1%. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14418	2023-01-24 09:20:32 -08:00
David Hedberg	37a27b4306	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com> Sponsored by: Findity AB Closes #11893 Closes #14358	2023-01-23 13:19:43 -08:00
Richard Yao	73968defdd	Reject streams that set ->drr_payloadlen to unreasonably large values In the zstream code, Coverity reported: "The argument could be controlled by an attacker, who could invoke the function with arbitrary values (for example, a very high or negative buffer size)." It did not report this in the kernel. This is likely because the userspace code stored this in an int before passing it into the allocator, while the kernel code stored it in a uint32_t. However, this did reveal a potentially real problem. On 32-bit systems and systems with only 4GB of physical memory or less in general, it is possible to pass a large enough value that the system will hang. Even worse, on Linux systems, the kernel memory allocator is not able to support allocations up to the maximum 4GB allocation size that this allows. This had already been limited in userspace to 64MB by `ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to protect systems. After some discussion, we settle on 256MB as a hard upper limit. Attempting to receive a stream that requires more memory than that will result in E2BIG being returned to user space. Reported-by: Coverity (CID-1529836) Reported-by: Coverity (CID-1529837) Reported-by: Coverity (CID-1529838) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14285	2023-01-23 13:16:22 -08:00
rob-wing	69f024a56e	Configure zed's diagnosis engine with vdev properties Introduce four new vdev properties: checksum_n checksum_t io_n io_t These properties can be used for configuring the thresholds of zed's diagnosis engine and are interpeted as <N> events in T <seconds>. When this property is set to a non-default value on a top-level vdev, those thresholds will also apply to its leaf vdevs. This behavior can be overridden by explicitly setting the property on the leaf vdev. Note that, these properties do not persist across vdev replacement. For this reason, it is advisable to set the property on the top-level vdev instead of the leaf vdev. The default values for zed's diagnosis engine (10 events, 600 seconds) remains unchanged. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: Seagate Technology LLC Closes #13805	2023-01-23 13:14:25 -08:00
Richard Yao	f091db9248	free_blocks(): Fix reports from 2016 PVS Studio FreeBSD report In 2016, the authors of PVS Studio ran it on the FreeBSD kernel, which identified a number of bugs / cleanup opportunities in the FreeBSD ZFS kernel code. A few of them persist to the present day: https://reviews.freebsd.org/D5245 Note that the scan was done against freebsd/freebsd-src@46763fd4ca. In particular, we have the following in free_blocks(): \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (174): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (171): error V634: The priority of the '' operation is higher than that of the '<<' operation. It's possible that parentheses should be used in the expression. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\dnode_sync.c (175): error V547: Expression '__left >= __right' is always true. Unsigned type value is always >= 0. A couple of assertions accidentally typecast the arguments they check to unsigned in such a way that the result is always true. Also, parentheses are missing around `1<<epbs` in `(db->db_blkid 1<<epbs)`. This works out to be okay due to multiplication not caring what order of operations we use, but it is better to fix it to be `(db->db_blkid << epbs)`. A few of the function local variables probably never should have been 32-bit in the first place, so we make them 64-bit. We also replace the existing assertions with additional assertions to ensure that 64-bit unsigned arithmetic is safe. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14407	2023-01-23 13:12:37 -08:00
Chunwei Chen	71974946be	Fix reading uninitialized variable in receive_read When zfs_file_read returns error, resid may be uninitialized. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #14404	2023-01-20 11:49:56 -08:00
Richard Yao	856cefcd1c	Cleanup ->dd_space_towrite should be unsigned This is only ever used with unsigned data, so the type itself should be unsigned. Also, PVS Studio's 2016 FreeBSD kernel report correctly identified the following assertion as always being true, so we can drop it: ASSERT3U(dd->dd_space_towrite[i & TXG_MASK], >=, 0); The reason it was always true is because it would do casts to give us unsigned comparisons. This could have been fixed by switching to `ASSERT3S()`, but upon inspection, it turned out that this variable never should have been allowed to be signed in the first place. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14408	2023-01-20 11:10:15 -08:00
Mark Johnston	ebabb93e6c	Micro-optimize dsl_prop_get_dd() Use the saved property index instead of looking it up once per DSL directory when traversing up towards the root. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #14397	2023-01-20 11:01:41 -08:00
Mark Johnston	7c30100c00	Avoid passing an uninitialized index to dsl_prop_known_index Reported-by: KMSAN Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: The FreeBSD Foundation Closes #14397	2023-01-20 11:00:38 -08:00
Richard Yao	2e7f664f04	Cleanup of dead code suggested by Clang Static Analyzer (#14380 ) I recently gained the ability to run Clang's static analyzer on the linux kernel modules via a few hacks. This extended coverage to code that was previously missed since Clang's static analyzer only looked at code that we built in userspace. Running it against the Linux kernel modules built from my local branch produced a total of 72 reports against my local branch. Of those, 50 were reports of logic errors and 22 were reports of dead code. Since we already had cleaned up all of the previous dead code reports, I felt it would be a good next step to clean up these dead code reports. Clang did a further breakdown of the dead code reports into: Dead assignment 15 Dead increment 2 Dead nested assignment 5 The benefit of cleaning these up, especially in the case of dead nested assignment, is that they can expose places where our error handling is incorrect. A number of them were fairly straight forward. However several were not: In vdev_disk_physio_completion(), not only were we not using the return value from the static function vdev_disk_dio_put(), but nothing used it, so I changed it to return void and removed the existing (void) cast in the other area where we call it in addition to no longer storing it to a stack value. In FSE_createDTable(), the function is dead code. Its helper function FSE_freeDTable() is also dead code, as are the CPP definitions in `module/zstd/include/zstd_compat_wrapper.h`. We just delete it all. In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig() returns 0 if there are waiting signals and 1 if there are none. The Linux SPL version literally returns `signal_pending(current) ? 0 : 1)` and FreeBSD implements the same semantics, we can just do `!cv_wait_sig()` in place of `signal_pending(current)` to avoid unnecessarily calling it again. zfs_setattr() on FreeBSD version did not have error handling issue because the code was removed entirely from FreeBSD version. The error is from updating the attribute directory's files. After some thought, I decided to propapage errors on it to userspace. In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the first check in favor of checking three other permissions. I assume this is intentional. In zfs_create_fs(), the return value of zap_update() was not checked despite setting an important version number. I see no backward compatibility reason to permit failures, so we add an assertion to catch failures. Interestingly, Linux is still using ASSERT(error == 0) from OpenSolaris while FreeBSD has switched to the improved ASSERT0(error) from illumos, although illumos has yet to adopt it here. ASSERT(error == 0) was used on Linux while ASSERT0(error) was used on FreeBSD since the entire file needs conversion and that should be the subject of another patch. dnode_move()'s issue was caused by us not having implemented POINTER_IS_VALID() on Linux. We have a stub in `include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be in `include/os/linux/spl/sys/kmem.h` to be consistent with Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and `POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we copy what it did. Whenever a report was in platform-specific code, I checked the FreeBSD version to see if it also applied to FreeBSD, but it was only relevant a few times. Lastly, the patch that enabled Clang's static analyzer to be run on the Linux kernel modules needs more work before it can be put into a PR. I plan to do that in the future as part of the on-going static analysis work that I am doing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14380	2023-01-17 09:57:12 -08:00
Richard Yao	d27c7ba62f	Linux ppc64le ieee128 compat: Do not redefine __asm on external headers There is an external assembly declaration extension in GNU C that glibc uses when building with ieee128 floating point support on ppc64le. Marking that as volatile makes no sense, so the build breaks. It does not make sense to only mark this as volatile on Linux, since if do not want the compiler reordering things on Linux, we do not want the compiler reordering things on any other platform, so we stop treating Linux specially and just manually inline the CPP macro so that we can eliminate it. This should fix the build on ppc64le. Tested-by: @gyakovlev Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14308 Closes #14384	2023-01-13 10:58:58 -08:00
Richard Yao	4ef69de384	Cleanup: Use NULL when doing NULL pointer comparisons The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/null/badzero.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:37 -08:00
Richard Yao	64195fc89f	Cleanup: Remove unneeded semicolons The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/semicolon.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:30 -08:00
Richard Yao	3b2f9c1ec8	Cleanup: Use MIN() macro The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/minmax.cocci There was a third opportunity to use `MIN()`, but that was in `FSE_minTableLog()` in `module/zstd/lib/compress/fse_compress.c`. Upstream zstd has yet to make this change and I did not want to change header includes just for MIN, or do a one off, so I left it alone. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:23 -08:00
Richard Yao	9c8fabffa2	Cleanup: Replace oldstyle struct hack with C99 flexible array members The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci However, unlike the cases where the GNU zero length array extension had been used, coccicheck would not suggest patches for the older style single member arrays. That was good because blindly changing them would break size calculations in most cases. Therefore, this required care to make sure that we did not break size calculations. In the case of `indirect_split_t`, we use `offsetof(indirect_split_t, is_child[is->is_children])` to calculate size. This might be subtly wrong according to an old mailing list thread: https://inbox.sourceware.org/gcc-prs/20021226123454.27019.qmail@sources.redhat.com/T/ That is because the C99 specification should consider the flexible array members to start at the end of a structure, but compilers prefer to put padding at the end. A suggestion was made to allow compilers to allocate padding after the VLA like compilers already did: http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm However, upon thinking about it, whether or not we allocate end of structure padding does not matter, so using offsetof() to calculate the size of the structure is fine, so long as we do not mix it with sizeof() on structures with no array members. In the case that we mix them and padding causes offsetof(struct_t, vla_member[0]) to differ from sizeof(struct_t), we would be doing unsafe operations if we underallocate via `offsetof()` and then overcopy via sizeof(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 16:00:03 -08:00
Richard Yao	8e7ebf4e2d	Cleanup: Use C99 flexible array members instead of zero length arrays The Linux 5.16.14 kernel's coccicheck caught this. The semantic patch that caught it was: ./scripts/coccinelle/misc/flexible_array.cocci The Linux kernel's documentation makes a good case for why we should not use these: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:41 -08:00
Richard Yao	7384ec65cd	Cleanup: Remove unnecessary explicit casts of pointers from allocators The Linux 5.16.14 kernel's coccicheck caught these. The semantic patch that caught them was: ./scripts/coccinelle/api/alloc/alloc_cast.cocci Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14372	2023-01-12 15:59:12 -08:00
George Amanakis	eee9362a72	Activate filesystem features only in syncing context When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14304 Closes #14252	2023-01-11 18:00:39 -08:00
Mateusz Piotrowski	926715b9fc	Turn default_bs and default_ibs into ZFS_MODULE_PARAMs The default_bs and default_ibs tunables control the default block size and indirect block size. So far, default_bs and default_ibs were tunable only on FreeBSD, e.g., sysctl vfs.zfs.default_ibs Remove the FreeBSD-specific sysctl code and expose default_bs and default_ibs as tunables on both Linux and FreeBSD using ZFS_MODULE_PARAM. One of the use cases for changing the values of those tunables is to lower the indirect block size, which may improve performance of large directories (as discussed during the OpenZFS Leadership Meeting on 2022-08-16). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14293	2023-01-11 09:38:20 -08:00
Mateusz Piotrowski	a4b21eadec	Add tunable to allow changing micro ZAP's max size This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called `zap_micro_max_size`. As a result, we can experiment with different micro ZAP sizes to improve directory size scaling. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com> Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14292	2023-01-10 13:41:54 -08:00
Matthew Ahrens	fc45975ec8	Batch enqueue/dequeue for bqueue The Blocking Queue (bqueue) code is used by zfs send/receive to send messages between the various threads. It uses a shared linked list, which is locked whenever we enqueue or dequeue. For workloads which process many blocks per second, the locking on the shared list can be quite expensive. This commit changes the bqueue logic to have 3 linked lists: 1. An enquing list, which is used only by the (single) enquing thread, and thus needs no locks. 2. A shared list, with an associated lock. 3. A dequing list, which is used only by the (single) dequing thread, and thus needs no locks. The entire enquing list can be moved to the shared list in constant time, and the entire shared list can be moved to the dequing list in constant time. These operations only happen when the `fill_fraction` is reached, or on an explicit flush request. Therefore, the lock only needs to be acquired infrequently. The API already allows for dequing to block until an explicit flush, so callers don't need to be changed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14121	2023-01-10 13:39:22 -08:00
Brian Behlendorf	0c8fbe5b6a	ztest: update ztest_dmu_snapshot_create_destroy() ECHRNG is returned when the channel program encounters a runtime error. For example, this can happen when a snapshot doesn't exist. We handle this error the same way as the existing EEXIST and ENOENT error checks. Additionally, improve the internal debug message to include the error describing why a pool couldn't be opened. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14351	2023-01-10 13:27:48 -08:00
Matthew Ahrens	40d7e971ff	ztest fails assertion in zio_write_gang_member_ready() Encrypted blocks can have up to 2 DVA's, as the third DVA is reserved for the salt+IV. However, dmu_write_policy() allows non-encrypted blocks (e.g. DMU_OT_OBJSET) inside encrypted datasets to request and allocate 3 DVA's, since they don't need a salt+IV (they are merely authenicated). However, if such a block becomes a gang block, the gang code incorrectly limits the gang block header to 2 DVA's. This leads to a "NDVAs inversion", where a parent block (the gang block header) has less DVA's than its children (the gang members), causing an assertion failure in zio_write_gang_member_ready(). This commit addresses the problem by only restricting the gang block header to 2 DVA's if the block is actually encrypted (and thus its gang block members can have at most 2 DVA's). Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14250 Closes #14356	2023-01-09 16:43:45 -08:00
Ameer Hamza	5091867ee6	zed: add hotplug support for spare vdevs This commit supports for spare vdev hotplug. The spare vdev associated with all the pools will be marked as "Removed" when the drive is physically detached and will become "Available" when the drive is reattached. Currently, the spare vdev status does not change on the drive removal and the same is the case with reattachment. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14295	2023-01-09 12:43:03 -08:00
Alexander Motin	289f7e6adb	Remove some dead ARC code. (#14340 ) Every ARC buffer holds a reference on the header. It means headers with buffers are never evictable. When we are evicting a header, there can be no more buffers to free. Just assert that. b_evict_lock seems not protecting anything now. Remove it. Buffers checksum should also be freed with the last uncompressed buffer, so it should not be there also when we are evicting the header. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.	2023-01-09 10:45:17 -08:00
Alexander Motin	bacf366fe2	Hide b_freeze_* under ZFS_DEBUG This saves 40 bytes per full ARC header, reducing it on FreeBSD from 240 to 200 bytes on production bits. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14315	2023-01-05 10:15:31 -07:00
Alexander Motin	ed2f7ba08d	Implement uncached prefetch Previously the primarycache property was handled only in the dbuf layer. Since the speculative prefetcher is implemented in the ARC, it had to be disabled for uncacheable buffers. This change gives the ARC knowledge about uncacheable buffers via arc_read() and arc_write(). So when remove_reference() drops the last reference on the ARC header, it can either immediately destroy it, or if it is marked as prefetch, put it into a new arc_uncached state. That state is scanned every second, evicting stale buffers that were not demand read. This change also tracks dbufs that were read from the beginning, but not to the end. It is assumed that such buffers may receive further reads, and so they are stored in dbuf cache. If a following reads reaches the end of the buffer, it is immediately evicted. Otherwise it will follow regular dbuf cache eviction. Since the dbuf layer does not know actual file sizes, this logic is not applied to the final buffer of a dnode. Since uncacheable buffers should no longer stay in the ARC for long, this patch also tries to optimize I/O by allocating ARC physical buffers as linear to allow buffer sharing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14243	2023-01-04 17:29:54 -07:00
Alexander Motin	c935fe2e92	arc_read()/arc_access() refactoring and cleanup ARC code was many times significantly modified over the years, that created significant amount of tangled and potentially broken code. This should make arc_access()/arc_read() code some more readable. - Decouple prefetch status tracking from b_refcnt. It made sense originally, but became highly cryptic over the years. Move all the logic into arc_access(). While there, clean up and comment state transitions in arc_access(). Some transitions were weird IMO. - Unify arc_access() calls to arc_read() instead of sometimes calling it from arc_read_done(). To avoid extra state changes and checks add one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS. - Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with the same callback mechanism to not falsely account them as hits. Count those as "iohits", an intermediate between "hits" and "misses". While there, call read callbacks in original request order, that should be good for fairness and random speculations/allocations/aggregations. - Introduce additional statistic counters for prefetch, accounting predictive vs prescient and hits vs iohits vs misses. - Remove hash_lock argument from functions not needing it. - Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places. - Fix few false positive tests found in the process. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14123	2022-12-22 12:10:24 -08:00
Matthew Ahrens	018f26041d	deadlock between spa_errlog_lock and dp_config_rwlock There is a lock order inversion deadlock between `spa_errlog_lock` and `dp_config_rwlock`: A thread in `spa_delete_dataset_errlog()` is running from a sync task. It is holding the `dp_config_rwlock` for writer (see `dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`. A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock` (see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as reader). Note that this was introduced by #12812. This commit address this by defining the lock ordering to be dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock. spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this order, and then process_error_block() and get_head_and_birth_txg() can verify that the dp_config_rwlock is already held. Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many code paths didn't check if `*count` got to zero, instead continuing to overwrite past the beginning of the userspace buffer at `uaddr`. Tested by having some errors in the pool (via `zinject -t data /path/to/file`), one thread running `zpool iostat 0.001`, and another thread runs `zfs destroy` (in a loop, although it hits the first time). This reproduces the problem easily without the fix, and works with the fix. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14239 Closes #14289	2022-12-22 11:48:49 -08:00
Richard Yao	f3f5263f8a	Zero end of embedded block buffer in dump_write_embedded() This fixes a kernel stack leak. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: Nicholas Sherlock <n.sherlock@gmail.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13778 Closes #14255	2022-12-13 17:31:47 -08:00
Richard Yao	3236c0b891	Cache dbuf_hash() calculation We currently compute a 64-bit hash three times, which consumes 0.8% CPU time on ARC eviction heavy workloads. Caching the 64-bit value in the dbuf allows us to avoid that overhead. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14251	2022-12-13 17:29:21 -08:00
Allan Jude	dc95911d21	zfs list: Allow more fields in ZFS_ITER_SIMPLE mode If the fields to be listed and sorted by are constrained to those populated by dsl_dataset_fast_stat(), then zfs list is much faster, as it does not need to open each objset and reads its properties. A previous optimization by Pawel Dawidek (`0cee24064a`) took advantage of this to make listing snapshot names sorted only by name much faster. However, it was limited to `-o name -s name`, this work extends this optimization to work with: - name - guid - createtxg - numclones - inconsistent - redacted - origin and could be further extended to any other properties supported by dsl_dataset_fast_stat() or similar, that do not require extra locking or reading from disk. This was committed before (9a9e2e343dfa2af28bf7910de77ae73aa006de62), but was reverted due to a regression when used with an older kernel. If the kernel does not populate zc->zc_objset_stats, we now fallback to getting the properties via the slower interface, to avoid problems with newer userland and older kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14110	2022-12-13 17:27:54 -08:00
Serapheim Dimitropoulos	7bf4c97a36	Bypass metaslab throttle for removal allocations Context: We recently had a scenario where a customer with 2x10TB disks at 95+% fragmentation and capacity, wanted to migrate their disks to a 2x20TB setup. So they added the 2 new disks and submitted the removal of the first 10TB disk. The removal took a lot more than expected (order of more than a week to 2 weeks vs a couple of days) and once it was done it generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB). Root-Cause: The removal code calls `metaslab_alloc_dva()` to allocate a new block for each evacuating block in the removing device and it tries to batch them into 16MB segments. If it can't find such a segment it tries for 8MBs, 4MBs, all the way down to 512 bytes. In our scenario what would happen is that `metaslab_alloc_dva()` from the removal thread pick the new devices initially but wouldn't allocate from them because of throttling in their metaslab allocation queue's depth (see `metaslab_group_allocatable()`) as these devices are new and favored for most types of allocations because of their free space. So then the removal thread would look at the old fragmented disk for allocations and wouldn't find any contiguous space and finally retry with a smaller allocation size until it would to the low KB range. This caused a lot of small mappings to be generated blowing up the size of the indirect table. It also wasted a lot of CPU while the removal was active making everything slow. This patch: Make all allocations coming from the device removal thread bypass the throttle checks. These allocations are not even counted in the metaslab allocation queues anyway so why check them? Side-Fix: Allocations with METASLAB_DONT_THROTTLE in their flags would not be accounted at the throttle queues but they'd still abide by the throttling rules which seems wrong. This patch fixes this by checking for that flag in `metaslab_group_allocatable()`. I did a quick check to see where else this flag is used and it doesn't seem like this change would cause issues. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #14159	2022-12-09 10:48:33 -08:00
Richard Yao	242a5b748c	Fix dereference after null check in enqueue_range If the bp is NULL, we have a hole. However, when we build with assertions, we will dereference bp when `blkid == DMU_SPILL_BLKID`. When this happens on a hole, we will have a NULL pointer dereference. Reported-by: Coverity (CID-1524670) Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14264	2022-12-08 14:15:21 -08:00
Richard Yao	56c6f293c0	Remove duplicate statically allocated variable dsl_dataset_snapshot_sync_impl() declares `static zil_header_t zero_zil __maybe_unused;`, but this is also declared globally. This wastes memory. CodeQL's cpp/local-variable-hides-global-variable check caught this. Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14263	2022-12-08 13:52:42 -08:00
George Wilson	ffd2e15d65	zio can deadlock during device removal When doing a device removal on a pool with gang blocks, the zio pipeline can deadlock when trying to free blocks from a device which is being removed with a stack similar to this: 0xffff8ab9a13a1740 UNINTERRUPTIBLE 4 __schedule+0x2e5 __schedule+0x2e5 schedule+0x33 schedule_preempt_disabled+0xe __mutex_lock.isra.12+0x2a7 __mutex_lock.isra.12+0x2a7 __mutex_lock_slowpath+0x13 mutex_lock+0x2c free_from_removing_vdev+0x61 metaslab_free_impl+0xd6 metaslab_free_dva+0x5e metaslab_free+0x196 zio_free_sync+0xe4 zio_free_gang+0x38 zio_gang_tree_issue+0x42 zio_gang_tree_issue+0xa2 zio_gang_issue+0x6d zio_execute+0x94 zio_execute+0x94 taskq_thread+0x23b kthread+0x120 ret_from_fork+0x1f Since there are gang blocks we have to read the gang members as part of the free. This can be seen with a zio dependency tree that looks like this: sdb> echo 0xffff900c24f8a700 \| zio -rc \| zio ADDRESS TYPE STAGE WAITER 0xffff900c24f8a700 NULL CHECKSUM_VERIFY 0xffff900ddfd31740 0xffff900c24f8c920 FREE GANG_ASSEMBLE - 0xffff900d93d435a0 READ DONE In the illustration above we are processing frees but because of gang block we have to read the constituents blocks. Once we finish the READ in the zio pipeline we will execute the parent. In this case the parent is a FREE but the zio taskq is a READ and we continue to process the pipeline leading to the stack above. In the stack above, we are blocked waiting for the svr_lock so as a result a READ interrupt taskq thread is now consumed. Eventually, all of the READ taskq threads end up blocked and we're unable to complete any read requests. In zio_notify_parent there is an optimization to continue to use the taskq thread to exectue the parent's pipeline. To resolve the deadlock above, we only allow this optimization if the parent's zio type matches the child which just completed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-80130 Closes #14236	2022-12-02 17:46:29 -08:00
George Wilson	d7cf06a25d	nopwrites on dmu_sync-ed blocks can result in a panic After a device has been removed, any nopwrites for blocks on that indirect vdev should be ignored and a new block should be allocated. The original code attempted to handle this but used the wrong block pointer when checking for indirect vdevs and failed to check all DVAs. This change corrects both of these issues and modifies the test case to ensure that it properly tests nopwrites with device removal. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14235	2022-12-02 17:45:33 -08:00
Rob Wing	7a75f74cec	Bump checksum error counter before reporting to ZED The checksum error counter is incremented after reporting to ZED. This leads ZED to receiving a checksum error report with 0 checksum errors. To avoid this, bump the checksum error counter before reporting to ZED. Sponsored-by: Seagate Technology LLC Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Closes #14190	2022-12-02 17:42:22 -08:00
szubersk	fe975048da	Fix Clang 15 compilation errors - Clang 15 doesn't support `-fno-ipa-sra` anymore. Do a separate check for `-fno-ipa-sra` support by $KERNEL_CC. - Don't enable `-mgeneral-regs-only` for certain module files. Fix #13260 - Scope `GCC diagnostic ignored` statements to GCC only. Clang doesn't need them to compile the code. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #13260 Closes #14150	2022-11-30 13:46:26 -08:00
Richard Yao	97fac0fb70	Fix NULL pointer dereference in dbuf_prefetch_indirect_done() When ZFS is built with assertions, a prefetch is done on a redacted blkptr and `dpa->dpa_dnode` is NULL, we will have a NULL pointer dereference in `dbuf_prefetch_indirect_done()`. Both Coverity and Clang's Static Analyzer caught this. Reported-by: Coverity (CID 1524671) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14210	2022-11-29 10:00:50 -08:00
Richard Yao	8532da5e20	Cleanup: Delete dead code from send_merge_thread() range is always deferenced before it reaches this check, such that the kmem_zalloc() call is never executed. A previously version of this had erronously also pruned the `range->eos_marker = B_TRUE` line, but it must be set whenever we encounter an error or are cancelled early. Coverity incorrectly complained about a potential NULL pointer dereference because of this. Reported-by: Coverity (CID 1524550) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14210	2022-11-29 09:59:53 -08:00
Alexander	b5459dd354	Fix the last two CFI callback prototype mismatches There was the series from me a year ago which fixed most of the callback vs implementation prototype mismatches. It was based on running the CFI-enabled kernel (in permissive mode -- warning instead of panic) and performing a full ZTS cycle, and then fixing all of the problems caught by CFI. Now, Clang 16-dev has new warning flag, -Wcast-function-type-strict, which detect such mismatches at compile-time. It allows to find the remaining issues missed by the first series. There are only two of them left: one for the secpolicy_vnode_setattr() callback and one for taskq_dispatch(). The fix is easy, since they are not used anywhere else. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #14207	2022-11-29 09:56:16 -08:00
Alexander Motin	fd61b2eaba	Remove few pointer dereferences in dbuf_read() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14199	2022-11-29 09:49:02 -08:00
Alexander Motin	4df415aa86	Switch dnode stats to wmsums I've noticed that some of those counters are used in hot paths like dnode_hold_impl(), and results of this change is visible in profiler. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14198	2022-11-29 09:33:45 -08:00
Alexander Motin	f0a76fbec1	Micro-optimize zrl_remove() atomic_dec_32() should be a bit lighter than atomic_dec_32_nv(). Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #14200	2022-11-29 09:26:03 -08:00
Ameer Hamza	e996c502e4	zed: unclean disk attachment faults the vdev If the attached disk already contains a vdev GUID, it means the disk is not clean. In such a scenario, the physical path would be a match that makes the disk faulted when trying to online it. So, we would only want to proceed if either GUID matches with the last attached disk or the disk is in a clean state. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14181	2022-11-29 09:24:10 -08:00

1 2 3 4 5 ...

2844 Commits