mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Martin Matuška	cd5b812818	Allow setting bootfs property on pools with indirect vdevs The FreeBSD boot loader relies on the bootfs property and is capable of booting from removed (indirect) vdevs. Reviewed-by Eric van Gyzen Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #11763	2021-03-19 22:46:43 -07:00
Serapheim Dimitropoulos	793c958f6f	Initialize metaslab range trees in metaslab_init = Motivation We've noticed several zloop crashes within Delphix generated due to the following sequence of events: - A device gets expanded and new metaslabas are allocated for it. These metaslabs go through `metaslab_init()` but haven't gone through `metaslab_sync_done()` yet. This meas that the only range tree that's actually set is the `ms_allocatable`. All the others are NULL. - A vdev_initialization is issues and `vdev_initialize_thread` starts processing one of these new metaslabs of the expanded vdev. - As part of `vdev_initialize_calculate_progress()` we call into `metaslab_load()` and `metaslab_load_impl()` which in turn tries to dereference the metaslabs trees that are still NULL and therefore we crash. The same failure can come up from the `vdev_trim` code paths. = This Patch We considered the following solutions to deal with this issue: [A] Add logic to `vdev_initialize/trim` to skip those new metaslabs. We decided against this as it would be good to avoid exposing this lower-level detail to higer-level operations. [B] Have `metaslab_load_impl()` return early for new metaslabs and thus never touch those range_trees that are NULL at that time. This seemed more of a work-around for the bug and not a clear-cut solution. [C] Refactor our logic so all metaslabs have their range_trees created at the time of their creatin in `metaslab_init()`. In this patch we decided to go with [C] because: (1) It doesn't expose more metaslab details to higher level operations such as vdev initialize and trim. (2) The current behavior of creating the range trees lazily in `metaslab_sync_done()` is unnecessarily complicated. (3) Always initializing the metaslab range_trees makes other parts of the codebase cleaner. For example, we used to use `ms_freed` as the reference value for knowing whether all the range_trees have been initialized. Now we no longer need to do that check in most places (and in the few that we do we use the `ms_new` boolean field now which is more readable). = Side Changes Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice in `metasloab_load_impl()`. In this patch we remove the extraneous assignment. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #11737	2021-03-19 22:36:02 -07:00
Matthew Ahrens	330c6c0523	Clean up RAIDZ/DRAID ereport code The RAIDZ and DRAID code is responsible for reporting checksum errors on their child vdevs. Checksum errors represent events where a disk returned data or parity that should have been correct, but was not. In other words, these are instances of silent data corruption. The checksum errors show up in the vdev stats (and thus `zpool status`'s CKSUM column), and in the event log (`zpool events`). Note, this is in contrast with the more common "noisy" errors where a disk goes offline, in which case ZFS knows that the disk is bad and doesn't try to read it, or the device returns an error on the requested read or write operation. RAIDZ/DRAID generate checksum errors via three code paths: 1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are reported on any children whose data was not used during the reconstruction. This is handled in `raidz_reconstruct()`. This is the most common type of RAIDZ/DRAID checksum error. 2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that means that the data has been lost. The zio fails and an error is returned to the consumer (e.g. the read(2) system call). This would happen if, for example, three different disks in a RAIDZ2 group are silently damaged. Since the damage is silent, it isn't possible to know which three disks are damaged, so a checksum error is reported against every child that returned data or parity for this read. (For DRAID, typically only one "group" of children is involved in each io.) This case is handled in `vdev_raidz_cksum_finish()`. This is the next most common type of RAIDZ/DRAID checksum error. 3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in case 2), but there happens to be additional copies of this block due to "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those copies is good, then RAIDZ/DRAID compares each sector of the data or parity that it retrieved with the good data from the other DVA, and if they differ then it reports a checksum error on this child. This differs from case 2 in that the checksum error is reported on only the subset of children that actually have bad data or parity. This case happens very rarely, since normally only metadata has ditto blocks. If the silent damage is extensive, there will be many instances of case 2, and the pool will likely be unrecoverable. The code for handling case 3 is considerably more complicated than the other cases, for two reasons: 1. It needs to run after the main raidz read logic has completed. The data RAIDZ read needs to be preserved until after the alternate DVA has been read, which necessitates refcounts and callbacks managed by the non-raidz-specific zio layer. 2. It's nontrivial to map the sections of data read by RAIDZ to the correct data. For example, the correct data does not include the parity information, so the parity must be recalculated based on the correct data, and then compared to the parity that was read from the RAIDZ children. Due to the complexity of case 3, the rareness of hitting it, and the minimal benefit it provides above case 2, this commit removes the code for case 3. These types of errors will now be handled the same as case 2, i.e. the checksum error will be reported against all children that returned data or parity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11735	2021-03-19 16:22:10 -07:00
Matthew Ahrens	46df6e98aa	Remove unused rr_code The `rr_code` field in `raidz_row_t` is unused. This commit removes the field, as well as the code that's used to set it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11736	2021-03-17 21:57:09 -07:00
Don Brady	dd0b5c8559	Reference_tracking_enable should be a module param To make use of zfs_refcount_held tunable it should be a module parameter in open-zfs. Also, since the macros will auto-generate OS specific tunables, removed the existing zfs_refcount_held reference in module/os/freebsd/zfs/sysctl_os.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #11753	2021-03-16 14:56:17 -07:00
Mateusz Guzik	5ebe425a5b	Macroify teardown lock handling This will allow platforms to implement it as they see fit, in particular in a different manner than rrm locks. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11153	2021-03-12 15:51:39 -08:00
Ryan Moeller	35aa9dc6df	FreeBSD: Fix scope of deadman tunables A few deadman tunables ended up in the wrong sysctl node. Move them to vfs.zfs.deadman.* Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11715	2021-03-11 19:23:24 -08:00
Christian Schwarz	93e3658035	zvol: call zil_replaying() during replay zil_replaying(zil, tx) has the side-effect of informing the ZIL that an entry has been replayed in the (still open) tx. The ZIL uses that information to record the replay progress in the ZIL header when that tx's txg syncs. ZPL log entries are not idempotent and logically dependent and thus calling zil_replaying() is necessary for correctness. For ZVOLs the question of correctness is more nuanced: ZVOL logs only TX_WRITE and TX_TRUNCATE, both of which are idempotent. Logical dependencies between two records exist only if the write or discard request had sync semantics or if the ranges affected by the records overlap. Thus, at a first glance, it would be correct to restart replay from the beginning if we crash before replay completes. But this does not address the following scenario: Assume one log record per LWB. The chain on disk is HDR -> 1:W(1, "A") -> 2:W(1, "B") -> 3:W(2, "X") -> 4:W(3, "Z") where N:W(O, C) represents log entry number N which is a TX_WRITE of C to offset A. We replay 1, 2 and 3 in one txg, sync that txg, then crash. Bit flips corrupt 2, 3, and 4. We come up again and restart replay from the beginning because we did not call zil_replaying() during replay. We replay 1 again, then interpret 2's invalid checksum as the end of the ZIL chain and call replay done. The replayed zvol content is "AX". If we had called zil_replaying() the HDR would have pointed to 3 and our resumed replay would not have replayed anything because 3 was corrupted, resulting in zvol content "BX". If 3 logically depends on 2 then the replay corrupted the ZVOL_OBJ's contents. This patch adds the zil_replaying() calls to the replay functions. Since the callbacks in the replay function need the zilog_t* pointer so that they can call zil_replaying() we open the ZIL while replaying in zvol_create_minor(). We also verify that replay has been done when on-demand-opening the ZIL on the first modifying bio. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11667	2021-03-07 09:49:58 -08:00
Ryan Moeller	4b2e20824b	Intentionally allow ZFS_READONLY in zfs_write ZFS_READONLY represents the "DOS R/O" attribute. When that flag is set, we should behave as if write access were not granted by anything in the ACL. In particular: We _must_ allow writes after opening the file r/w, then setting the DOS R/O attribute, and writing some more. (Similar to how you can write after fchmod(fd, 0444).) Restore these semantics which were lost on FreeBSD when refactoring zfs_write. To my knowledge Linux does not actually expose this flag, but we'll need it to eventually so I've added the supporting checks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11693	2021-03-07 09:31:52 -08:00
Jorgen Lundman	8a6d444825	Fix abd_get_offset_struct() may allocate new abd Even when supplied with an abd to abd_get_offset_struct(), the call to abd_get_offset_impl() can allocate a different abd. Ensure to call abd_fini_struct() on the abd that is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #11683	2021-03-05 12:22:57 -08:00
nssrikanth	bedbc13daa	Cancel TRIM / initialize on FAULTED non-writeable vdevs When a device which is actively trimming or initializing becomes FAULTED, and therefore no longer writable, cancel the active TRIM or initialization. When the device is merely taken offline with `zpool offline` then stop the operation but do not cancel it. When the device is brought back online the operation will be resumed if possible. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Vipin Kumar Verma <vipin.verma@hpe.com> Signed-off-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com> Closes #11588	2021-03-02 10:27:27 -08:00
Brian Behlendorf	8e43fa12c5	Fix vdev_rebuild_thread deadlock The metaslab_disable() call may block waiting for a txg sync. Therefore it's important that vdev_rebuild_thread release the SCL_CONFIG read lock it is holding before this call. Failure to do so can result in the txg_sync thread getting blocked waiting for this lock which results in a deadlock. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewd-by: Srikanth N S <srikanth.nagasubbaraoseetharaman@hpe.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11647	2021-02-24 10:01:00 -08:00
Brian Behlendorf	75a089ed34	Fix overly broad locking in spa_vdev_config_exit() Calling vdev_free() only requires the we acquire the spa config SCL_STATE_ALL locks, not the SCL_ALL locks. In particular, we need need to avoid taking the SCL_CONFIG lock (included in SCL_ALL) as a writer since this can lead to a deadlock. The txg_sync_thread() may block in spa_txg_history_init_io() when taking the SCL_CONFIG lock as a reading when it detects there's a pending writer. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11585	2021-02-24 10:00:21 -08:00
Prakash Surya	f01eaed455	Add upper bound for slop space calculation This change modifies the behavior of how we determine how much slop space to use in the pool, such that now it has an upper limit. The default upper limit is 128G, but is configurable via a tunable. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #11023	2021-02-24 09:52:43 -08:00
Ryan Moeller	5156862960	Wrap bare EINVAL returns with SET_ERROR Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11636	2021-02-24 09:51:10 -08:00
fbynite	11f2e9a491	vdev_ops: don't try to call vdev_op_hold or vdev_op_rele when NULL This prevents a panic after a SLOG add/removal on the root pool followed by a zpool scrub. When a SLOG is removed, a hole takes its place - the vdev_ops for a hole is vdev_hole_ops, which defines the handler functions of vdev_op_hold and vdev_op_rele as NULL. This bug has been reported in illumos and FreeBSD, a different trigger in the FreeBSD report though. Credit for this patch goes to Patrick Mooney <pmooney@pfmooney.com> Obtained from: illumos-gate commit: c65bd18728f34725 External-issue: https://www.illumos.org/issues/12981 External-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252396 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Wing <rob.fx907@gmail.com> Closes #11623	2021-02-20 20:19:20 -08:00
Brian Atkinson	c0801bf35a	Cleaning up uio headers Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11622	2021-02-20 20:16:50 -08:00
Ryan Moeller	64e0fe14ff	Restore FreeBSD resource usage accounting Add zfs_racct_* interfaces for platform-dependent read/write accounting. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11613	2021-02-19 22:34:33 -08:00
Don Brady	03e02e5b56	Checksum errors may not be counted Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #11609	2021-02-19 22:33:15 -08:00
Colm	658fb8020f	Add "compatibility" property for zpool feature sets Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off\|legacy\|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in all files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11468	2021-02-17 21:30:45 -08:00
khng300	fc273894d2	Rename zfs_inode_update to zfs_znode_update_vfs zfs_znode_update_vfs is a more platform-agnostic name than zfs_inode_update. Besides that, the function's prototype is moved to include/sys/zfs_znode.h as the function is also used in common code. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ka Ho Ng <khng300@gmail.com> Sponsored by: The FreeBSD Foundation Closes #11580	2021-02-09 11:17:29 -08:00
Antonio Russo	f8ce8aed0c	Set file mode during zfs_write `3d40b65` refactored zfs_vnops.c, which shared much code verbatim between Linux and BSD. After a successful write, the suid/sgid bits are reset, and the mode to be written is stored in newmode. On Linux, this was propagated to both the in-memory inode and znode, which is then updated with sa_update. `3d40b65` accidentally removed the initialization of newmode, which happened to occur on the same line as the inode update (which has been moved out of the function). The uninitialized newmode can be saved to disk, leading to a crash on stat() of that file, in addition to a merely incorrect file mode. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11474 Closes #11576	2021-02-08 09:15:05 -08:00
Christian Schwarz	84268b099b	Document monotonicity of dmu_tx_assign() and txg_hold_open() Expand the comments to make it clear exactly what is guaranteed by dmu_tx_assign() and txg_hold_open(). Additionally, update the comment which refers to txg_exit() when it should reference txg_rele_to_sync(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11521	2021-02-02 10:11:37 -08:00
Matthew Ahrens	2d4bbd14fc	The abd child/parent relationship does not need to be tracked ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11535	2021-01-30 10:04:42 -08:00
Brian Atkinson	2993698eb3	Fixing gang ABD when adding another gang I originally applied a fix in #11539 to fix a parent's child references when a gang ABD is free'd. However, I did not take into account abd_gang_add_gang(). We still need to make sure to update the child references in this function as well. In order to resolve this I removed decreasing the gang ABD's size in abd_free_gang() as well as moved back the original placeent of zfs_refcount_remove_many() in abd_free(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11542	2021-01-28 16:54:12 -08:00
George Amanakis	0ae184a6ba	Avoid updating the L2ARC device header unnecessarily If we do not write any buffers to the cache device and the evict hand has not advanced do not update the cache device header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11522 Closes #11537	2021-01-28 09:20:03 -08:00
Brian Atkinson	416015ef54	Removing ABD Parent Child Reference Before Freeing ABD Moving the call to zfs_refcount_remove_many() in abd_free() to be called before any of the ABD free variants are called. This is necessary because abd_free_gang() adjusts the abd_size for the gang ABD. If the parent's child references are removed after free'ing the gang ABD the refcount is not adjusted correctly for the parent's children. I also removed some stray abd_put() in comments and changed abd_free_gang_abd() -> abd_free_gang(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11539	2021-01-28 09:15:17 -08:00
Mark Maybee	b2c5904a78	Revert special case code from pre-hashtable nvlist era Before a hash table was added on top of the nvlist code, there were cases where the nvlist allocation was changed from fnvlist_alloc() to nvlist_alloc() to avoid expensive NV_UNIQUE_NAME checks. Now this is no longer necessary. These changes should be reverted to be consistent with other code. There are some cases where this change will also reduce the number of iterations. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Maybee <mark.maybee@delphix.com> Closes #11464	2021-01-27 21:31:51 -08:00
Alan Somers	cf0977ad72	Parallelize vdev_validate The runtime of vdev_validate is dominated by the disk accesses in vdev_label_read_config. Speed it up by validating all vdevs in parallel using a taskq. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:36:51 -08:00
Alan Somers	67874d5487	Read all disk labels concurrently in vdev_label_read_config This is similar to what we already do in vdev_geom_read_config. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:36:02 -08:00
Alan Somers	a0e01997ec	Parallelize vdev_load metaslab_init is the slowest part of importing a mature pool, and it must be repeated hundreds of times for each top-level vdev. But its speed is dominated by a few serialized disk accesses. That can lead to import times of > 1 hour for pools with many top-level vdevs on spinny disks. Speed up the import by using a taskqueue to parallelize vdev_load across all top-level vdevs. This also requires adding mutex protection to metaslab_class_t.mc_historgram. The mc_histogram fields were unprotected when that code was first written in "Illumos 4976-4984 - metaslab improvements" (OpenZFS `f3a7f6610f`). The lock wasn't added until `3dfb57a35e`, though it's unclear exactly which fields it's supposed to protect. In any case, it wasn't until vdev_load was parallelized that any code attempted concurrent access to those fields. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:35:59 -08:00
Matthew Ahrens	62d4287f27	RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disks When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. The problem can be noticed by scrubbing the pool twice. Assuming there was no damage concurrent with the scrubs, the first scrub should fix all silent damage, and the second scrub should be "clean" (`zpool status` should not report checksum errors on any disks). If the bug is encountered, then the second scrub will repair the silently-damaged parity that the first scrub failed to repair, and these checksum errors will be reported after the second scrub. Since the first scrub repaired all the damaged data, the bug can not be encountered during the second scrub, so subsequent scrubs (more than two) are not necessary. The root cause of the problem is some code that was inadvertently added to `raidz_parity_verify()` by the DRAID changes. The incorrect code causes the parity healing to be aborted if there is damaged data (`rc_error != 0`) or the data disk is not present (`!rc_tried`). These checks are not necessary, because we only call `raidz_parity_verify()` if we have the correct data (which may have been reconstructed using parity, and which was verified by the checksum). This commit fixes the problem by removing the incorrect checks in `raidz_parity_verify()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11489 Closes #11510	2021-01-26 16:05:05 -08:00
Will Andrews	f4f50a7048	spa_export_common: refactor common exit points Create a common exit point for spa_export_common (a very long function), which avoids missing steps on failure. This work is helpful for the planned forced pool export changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Will Andrews <will@firepipe.net> Closes #11514	2021-01-25 15:04:11 -08:00
Colm	4a90d4d6fc	Fix two minor lint errors (cppcheck) Fix two minor errors reported by cppcheck: In module/zfs/abd.c (abd_get_offset_impl), add non-NULL assertion to prevent NULL dereference warning. In module/zfs/arc.c (l2arc_write_buffers), change 'try' variable to 'pass' to avoid C++ reserved word. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11507	2021-01-23 15:49:32 -08:00
Alexander Motin	5aa69a57da	Relax special_small_blocks assertion. Follow up for commit `624222a`, value asserted <= SPA_OLD_MAXBLOCKSIZE instead of SPA_MAXBLOCKSIZE as it should be after the previous change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11501	2021-01-23 15:45:27 -08:00
Matthew Ahrens	aa755b3549	Set aside a metaslab for ZIL blocks Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11389	2021-01-21 15:12:54 -08:00
Brian Atkinson	d0cd9a5cc6	Extending FreeBSD UIO Struct In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11438	2021-01-20 21:27:30 -08:00
Matthew Ahrens	e2af2acce3	allow callers to allocate and provide the abd_t struct The `abd_get_offset_()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #8853 Closes #11439	2021-01-20 11:24:37 -08:00
Matthew Ahrens	2ac90457f5	record ioctl elapsed time in zpool history Each zfs ioctl that changes on-disk state (e.g. set property, create snapshot, destroy filesystem) is recorded in the zpool history, and is printed by `zpool history -i`. For performance diagnostic purposes, it would be useful to know how long each of these ioctls took to run. This commit adds that functionality, with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist. Additionally, the time recorded in this history log is currently the time that the history record is written to disk. But in many cases (CLI args logging and ioctl logging), this happens asynchronously, potentially many seconds after the operation completed. This commit changes the timestamp to reflect when the history event was created, rather than when it was written to disk. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11440	2021-01-11 09:29:25 -08:00
Matthew Ahrens	dc303dcf5b	assertion failed in arc_wait_for_eviction() If the system is very low on memory (specifically, `arc_free_memory() < arc_sys_free/2`, i.e. less than 1/16th of RAM free), `arc_evict_state_impl()` will defer wakups. In this case, the arc_evict_waiter_t's remain on the list, even though `arc_evict_count` has been incremented past their `aew_count`. The problem is that `arc_wait_for_eviction()` assumes that if there are waiters on the list, the count they are waiting for has not yet been reached. However, the deferred wakeups may violate this, causing `ASSERT(last->aew_count > arc_evict_count)` to fail. This commit resolves the issue by having new waiters use the greater of `arc_evict_count` and the last `aew_count`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11285 Closes #11397	2021-01-07 20:06:32 -08:00
Toomas Soome	40ab927ae8	implicit conversion from 'boolean_t' to 'ds_hold_flags_t' Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 \| dsl_dataset_disown(ds, decrypt, tag); \| ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 \| err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); \| ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #11406	2020-12-27 16:31:02 -08:00
Brian Behlendorf	0c763f76b1	Remove unused check from dmu_tx_count_write() Individual transactions may not be larger than DMU_MAX_ACCESS. This is enforced by the assertions in dmu_tx_hold_write() and dmu_tx_hold_write_by_dnode(). There's an additional check in dmu_tx_count_write() however it has no effect and only sets a local err variable. We could enable this check, however since it's already enforced by ASSERTs elsewhere I opted to remove it instead. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3731 Closes #11384	2020-12-21 20:17:13 -08:00
Andy Fiddaman	39372fa25b	Dangling reference from dmu_objset_upgrade After porting the fix for https://github.com/openzfs/zfs/issues/5295 over to illumos, we started hitting an assertion failure when running the testsuite: assertion failed: rc->rc_count == number, file: .../refcount.c and the unexpected hold has this stack: dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73 dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f The simplest reproducer for this in illumos is zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool which is run as part of the zpool_create_tempname test, but I can't get this to trigger on FreeBSD. This appears to be because of the call to txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in illumos), slows down dmu_objset_disown() enough to avoid the condition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andy Fiddaman <andy@omnios.org> Closes #11368	2020-12-21 10:13:23 -08:00
Christian Schwarz	49c482fde3	dsl_pool: extend comment on DSL Pool Configuration Lock Based on a conversation with Matt on the OpenZFS Slack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11370	2020-12-19 18:04:05 -08:00
Brian Behlendorf	1c2358c12a	Linux 5.10 compat: use iov_iter in uio structure As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11351	2020-12-18 08:48:26 -08:00
Matthew Ahrens	71e4ce0e52	special device removal space accounting fixes The space in special devices is not included in spa_dspace (or dsl_pool_adjustedsize(), or the zfs `available` property). Therefore there is always at least as much free space in the normal class, as there is allocated in the special class(es). And therefore, there is always enough free space to remove a special device. However, the checks for free space when removing special devices did not take this into account. This commit corrects that. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11329	2020-12-17 12:11:56 -08:00
Ryan Moeller	1531506d23	Avoid extra work updating ARC kstats and tunables After `e357046` it should not be necessary to periodically update ARC kstats and tunables. Tunable updates are applied when modified, and kstats are updated on demand. Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11237	2020-12-17 11:16:42 -08:00
Matthew Ahrens	be5c6d9653	Only examine best metaslabs on each vdev On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11327	2020-12-16 14:40:05 -08:00
Alexander Motin	f8020c9363	Make metaslab class rotor and aliquot per-allocator. Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11288	2020-12-15 10:55:44 -08:00
Matthew Macy	923d730329	dmu_zfetch: fix memory leak The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11324	2020-12-12 16:00:00 -08:00

1 2 3 4 5 ...

2473 Commits