mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-13 15:11:45 +03:00

Author	SHA1	Message	Date
Andriy Tkachuk	b403040c4c	draid: fix data corruption after disk clear Currently, when there there are several faulted disks with attached dRAID spares, and one of those disks is cleared from errors (zpool clear), followed by its spare being detached, the data in all the remaining spares that were attached while the cleared disk was in FAULTED state might get corrupted (which can be seen by running scrub). In some cases, when too many disks get cleared at a time, this can result in data corruption/loss. dRAID spare is a virtual device whose blocks are distributed among other disks. Those disks can be also in FAULTED state with attached spares on their own. When a disk gets sequentially resilvered (rebuilt), the changes made by that resilvering won't get captured in the DTL (Dirty Time Log) of other FAULTED disks with the attached spares to which the data is written during the resilvering (as it would normally be done for the changes made by the user if a new file is written or some existing one is deleted). It is because sequential resilvering works on the block level, without touching or looking into metadata, so it doesn't know anything about the old BPs or transactions groups that it is resilvering. So later on, when that disk gets cleared from errors and healing resilvering is trying to sync all the data from its spare onto it, all the changes made on its spare during the resilvering of other disks will be missed because they won't be captured in its DTL. That's why other dRAID spares may get corrupted. Here's another way to explain it that might be helpful. Imagine a scenario: 1. d1 fails and gets resilvered to some spare s1 - OK. 2. d2 fails and gets sequentially resilvered on draid spare s2. Now, in some slices, s2 would map to d1, which is failed. But d1 has s1 spare attached, so the data from that resilvering goes to s1, but not recorded in d1's DTL. 3. Now, d1 gets cleared and its s1 gets detached. All the changes done by the user (writes or deletions) have their txgs captured in d1's DTL, so they will be resilvered by the healing resilver from its spare (s1) - that part works fine. But the data which was written during resilvering of d2 and went to s1 - that one will be missed from d1's DTL and won't get resilvered to it. So here we are: 4. s2 under d2 is corrupted in the slices which map to d1, because d1 doesn't have that data resilvered from s1. Now, if there are more failed disks with draid spares attached which were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and so on - all their spares will be corrupted. Because, in some slices, each of them will map to d1 which will miss their data. Solution: add all known txgs starting from TXG_INITIAL to DTLs of non-writable devices during sequential resilvering so when healing resilver starts on disk clear, it would be able to check and heal blocks from all txgs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18286 Closes #18294	2026-03-11 14:54:20 -07:00
Andriy Tkachuk	f8457fbdc4	Fix deadlock on dmu_tx_assign() from vdev_rebuild() vdev_rebuild() is always called with spa_config_lock held in RW_WRITER mode. However, when it tries to call dmu_tx_assign() the latter may hang on dmu_tx_wait() waiting for available txg. But that available txg may not happen because txg_sync takes spa_config_lock in order to process the current txg. So we have a deadlock case here: - dmu_tx_assign() waits for txg holding spa_config_lock; - txg_sync waits for spa_config_lock not progressing with txg. Here are the stacks: __schedule+0x24e/0x590 schedule+0x69/0x110 cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] dmu_tx_wait+0x8e/0x1e0 [zfs] dmu_tx_assign+0x49/0x80 [zfs] vdev_rebuild_initiate+0x39/0xc0 [zfs] vdev_rebuild+0x84/0x90 [zfs] spa_vdev_attach+0x305/0x680 [zfs] zfs_ioc_vdev_attach+0xc7/0xe0 [zfs] cv_wait_common+0xf8/0x130 [spl] __cv_wait+0x15/0x20 [spl] spa_config_enter+0xf9/0x120 [zfs] spa_sync+0x6d/0x5b0 [zfs] txg_sync_thread+0x266/0x2f0 [zfs] The solution is to pass txg returned by spa_vdev_enter(spa) at the top of spa_vdev_attach() to vdev_rebuild() and call dmu_tx_create_assigned(txg) which doesn't wait for txg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18210 Closes #18258	2026-02-26 11:18:02 -08:00
Rob Norris	eb9098ed47	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:27 -07:00
Ivan Volosyuk	d4a5a7e3aa	Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_* Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-14 15:37:48 -08:00
Alexander Motin	fa7b2390d4	Do not report bytes skipped by scan as issued. Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as `c85ac731a0`, but from a different perspective, complementing it. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15007	2023-06-30 08:47:13 -07:00
Tino Reichardt	1d3ba0bf01	Replace dead opensolaris.org license link The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13619	2022-07-11 14:16:13 -07:00
Andrea Gelmini	bf169e9f15	Fix various typos Correct an assortment of typos throughout the code base. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #11774	2021-04-02 18:52:15 -07:00
Brian Behlendorf	b2255edcc0	Distributed Spare (dRAID) Feature This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid\|raidz\|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102	2020-11-13 13:51:51 -08:00
Brian Behlendorf	9a49d3f3d3	Add device rebuild feature The device_rebuild feature enables sequential reconstruction when resilvering. Mirror vdevs can be rebuilt in LBA order which may more quickly restore redundancy depending on the pools average block size, overall fragmentation and the performance characteristics of the devices. However, block checksums cannot be verified as part of the rebuild thus a scrub is automatically started after the sequential resilver completes. The new '-s' option has been added to the `zpool attach` and `zpool replace` command to request sequential reconstruction instead of healing reconstruction when resilvering. zpool attach -s <pool> <existing vdev> <new vdev> zpool replace -s <pool> <old vdev> <new vdev> The `zpool status` output has been updated to report the progress of sequential resilvering in the same way as healing resilvering. The one notable difference is that multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. The `zpool wait -t resilver` command was extended to wait on sequential resilvers. From this perspective they are no different than healing resilvers. Sequential resilvers cannot be supported for RAIDZ, but are compatible with the dRAID feature being developed. As part of this change the resilver_restart_* tests were moved in to the functional/replacement directory. Additionally, the replacement tests were renamed and extended to verify both resilvering and rebuilding. Original-patch-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: John Poduska <jpoduska@datto.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10349	2020-07-03 11:05:50 -07:00

9 Commits