mirror of
https://git.proxmox.com/git/mirror_zfs.git
synced 2026-05-23 10:54:35 +03:00
draid: fix data corruption after disk clear
Currently, when there there are several faulted disks with attached dRAID spares, and one of those disks is cleared from errors (zpool clear), followed by its spare being detached, the data in all the remaining spares that were attached while the cleared disk was in FAULTED state might get corrupted (which can be seen by running scrub). In some cases, when too many disks get cleared at a time, this can result in data corruption/loss. dRAID spare is a virtual device whose blocks are distributed among other disks. Those disks can be also in FAULTED state with attached spares on their own. When a disk gets sequentially resilvered (rebuilt), the changes made by that resilvering won't get captured in the DTL (Dirty Time Log) of other FAULTED disks with the attached spares to which the data is written during the resilvering (as it would normally be done for the changes made by the user if a new file is written or some existing one is deleted). It is because sequential resilvering works on the block level, without touching or looking into metadata, so it doesn't know anything about the old BPs or transactions groups that it is resilvering. So later on, when that disk gets cleared from errors and healing resilvering is trying to sync all the data from its spare onto it, all the changes made on its spare during the resilvering of other disks will be missed because they won't be captured in its DTL. That's why other dRAID spares may get corrupted. Here's another way to explain it that might be helpful. Imagine a scenario: 1. d1 fails and gets resilvered to some spare s1 - OK. 2. d2 fails and gets sequentially resilvered on draid spare s2. Now, in some slices, s2 would map to d1, which is failed. But d1 has s1 spare attached, so the data from that resilvering goes to s1, but not recorded in d1's DTL. 3. Now, d1 gets cleared and its s1 gets detached. All the changes done by the user (writes or deletions) have their txgs captured in d1's DTL, so they will be resilvered by the healing resilver from its spare (s1) - that part works fine. But the data which was written during resilvering of d2 and went to s1 - that one will be missed from d1's DTL and won't get resilvered to it. So here we are: 4. s2 under d2 is corrupted in the slices which map to d1, because d1 doesn't have that data resilvered from s1. Now, if there are more failed disks with draid spares attached which were sequentially resilvered while d1 was failed, d3+s3, d4+s4 and so on - all their spares will be corrupted. Because, in some slices, each of them will map to d1 which will miss their data. Solution: add all known txgs starting from TXG_INITIAL to DTLs of non-writable devices during sequential resilvering so when healing resilver starts on disk clear, it would be able to check and heal blocks from all txgs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Akash B <akash-b@hpe.com> Signed-off-by: Andriy Tkachuk <andriy.tkachuk@seagate.com> Closes #18286 Closes #18294
This commit is contained in:
@@ -1082,6 +1082,7 @@ extern uint64_t spa_guid(spa_t *spa);
|
||||
extern uint64_t spa_load_guid(spa_t *spa);
|
||||
extern uint64_t spa_last_synced_txg(spa_t *spa);
|
||||
extern uint64_t spa_first_txg(spa_t *spa);
|
||||
extern uint64_t spa_open_txg(spa_t *spa);
|
||||
extern uint64_t spa_syncing_txg(spa_t *spa);
|
||||
extern uint64_t spa_final_dirty_txg(spa_t *spa);
|
||||
extern uint64_t spa_version(spa_t *spa);
|
||||
|
||||
@@ -91,6 +91,7 @@ boolean_t vdev_rebuild_active(vdev_t *);
|
||||
|
||||
int vdev_rebuild_load(vdev_t *);
|
||||
void vdev_rebuild(vdev_t *, uint64_t);
|
||||
void vdev_rebuild_txgs(vdev_t *, uint64_t *, uint64_t *);
|
||||
void vdev_rebuild_stop_wait(vdev_t *);
|
||||
void vdev_rebuild_stop_all(spa_t *);
|
||||
void vdev_rebuild_restart(spa_t *);
|
||||
|
||||
Reference in New Issue
Block a user