Commit Graph

305 Commits

Author SHA1 Message Date
Don Brady
7bbd42ef49
Don't allow attach to a raidz child vdev
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15536
Closes #15564
2023-11-27 09:46:38 -08:00
Don Brady
5caeef02fa
RAID-Z expansion feature
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #15022
2023-11-08 10:19:41 -08:00
ednadolski-ix
3bd4df3841
Improve ZFS objset sync parallelism
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15197
2023-11-06 10:38:42 -08:00
Rob N
b5e6091885
spa: document spa_thread() and SDC feature gates
spa_thread() and the "System Duty Cycle" scheduling class are from
Illumos and have not yet been adapted to Linux or FreeBSD.

HAVE_SPA_THREAD has long been explicitly undefined and used to mark
spa_thread(), but there's some related taskq code that can never be
invoked without it, which makes some already-tricky code harder to read.

HAVE_SYSDC is introduced in this commit to mark the SDC parts. SDC
requires spa_thread(), but the inverse is not true, so they are
separate.

I don't want to make the call to just remove it because I still harbour
hopes that OpenZFS could become a first-class citizen on Illumos
someday. But hopefully this will at least make the reason it exists a
bit clearer for people without long memories and/or an interest in
history.

For those that are interested in the history, the original FreeBSD port
of ZFS (before ZFS-on-Linux was adopted there) did have a spa_thread(),
but not SDC. The last version of that before it was removed can be read
here:

  22df1ffd81/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c

Meanwhile, more information on the SDC scheduling class is here:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/disp/sysdc.c

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15406
2023-10-23 08:50:55 -07:00
Rob Norris
54b1b1d893 import: require force when cachefile hostid doesn't match on-disk
Previously, if a cachefile is passed to zpool import, the cached config
is mostly offered as-is to ZFS_IOC_POOL_TRYIMPORT->spa_tryimport(), and
the results are taken as the canonical pool config and handed back to
ZFS_IOC_POOL_IMPORT.

In the course of its operation, spa_load() will inspect the pool and
build a new config from what it finds on disk. However, it then
regenerates a new config ready to import, and so rightly sets the hostid
and hostname for the local host in the config it returns.

Because of this, the "require force" checks always decide the pool is
exported and last touched by the local host, even if this is not true,
which is possible in a HA environment when MMP is not enabled. The pool
may be imported on another head, but the import checks still pass here,
so the pool ends up imported on both.

(This doesn't happen when a cachefile isn't used, because the pool
config is discovered in userspace in zpool_find_import(), and that does
find the on-disk hostid and hostname correctly).

Since the systemd zfs-import-cache.service unit uses cachefile imports,
this can lead to a system returning after a crash with a "valid"
cachefile on disk and automatically, quietly, importing a pool that has
already been taken up by a secondary head.

This commit causes the on-disk hostid and hostname to be included in the
ZPOOL_CONFIG_LOAD_INFO item in the returned config, and then changes the
"force" checks for zpool import to use them if present.

This method should give no change in behaviour for old userspace on new
kernels (they won't know to look for the new config items) and for new
userspace on old kernels (the won't find the new config items).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15290
2023-10-06 09:24:44 -07:00
Alexander Motin
342357cd9e
Reduce number of metaslab preload taskq threads.
Before this change ZFS created threads for 50% of CPUs for each top-
level vdev.  Plus it created the same number of threads for embedded
log groups (that have only one metaslab and don't need any preload).
As result, on system with 80 CPUs and pool of 60 vdevs this resulted
in 4800 metaslab preload threads, that is absolutely insane.

This patch changes the preload threads to 50% of CPUs in one taskq
per pool, so on the mentioned system it will be only 40 threads.

Among other things this fixes zdb on the mentioned system and pool
on FreeBSD, that failed to create so many threads in one process.

Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15319
2023-10-06 09:04:00 -07:00
ednadolski-ix
95f71c019d
Selectable block allocators
ZFS historically has had several space allocators that were
dynamically selectable.  While these have been retained in 
OpenZFS, only a single allocator has been statically compiled 
in. This patch compiles all allocators for OpenZFS and provides 
a module parameter to allow for manual selection between them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15218
2023-09-01 18:00:30 -07:00
Alexander Motin
b3ad3f48d9
Use list_remove_head() where possible.
... instead of list_head() + list_remove().  On FreeBSD the list
functions are not inlined, so in addition to more compact code
this also saves another function call.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14955
2023-06-09 10:12:52 -07:00
Akash B
9d618615d1
Fix concurrent resilvers initiated at same time
For draid vdevs it was possible to initiate both the
sequential and healing resilver at same time.

This fixes the following two scenarios.
     1) There's a window where a sequential rebuild can
be started via ZED even if a healing resilver has been
scheduled.
	- This is fixed by adding additional check in
spa_vdev_attach() for any scheduled resilver and return
appropriate error code when a resilver is already in
progress.

     2) It was possible for zpool clear to start a healing
resilver when it wasn't needed at all. This occurs because
during a vdev_open() the device is presumed to be healthy not
until the device is validated by vdev_validate() and it's set
unavailable. However, by this point an async resilver will
have already been requested if the DTL isn't empty.
	- This is fixed by cancelling the SPA_ASYNC_RESILVER
request immediately at the end of vdev_reopen() when a resilver
is unneeded.

Finally, added a testcase in ZTS for verification.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
Signed-off-by: Akash B <akash-b@hpe.com>
Closes #14881
Closes #14892
2023-05-24 12:28:09 -07:00
George Amanakis
482eeef804 Teach zpool scrub to scrub only blocks in error log
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355
2023-05-18 11:59:42 -07:00
Brian Behlendorf
e34e15ed6d
Add the ability to uninitialize
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451 
Closes #14873
2023-05-18 10:02:20 -07:00
Richard Yao
ee7b71dbc9 Fix undefined behavior in spa_sync_props()
8eae2d214c caused Coverity to begin
complaining about "Improper use of negative value" in two places in
spa_sync_props() because Coverity correctly inferred from `prop ==
ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
and zpool_prop_get_type() use it an array index, which is undefined
behavior.

Assuming that the system does not panic from an attempt to read invalid
memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
that execution will continue safely. However, if we are unlucky enough
to read invalid memory, then the system will panic.

This issue predates the patch that caused coverity to begin complaining.
Thankfully, our userland tools do not pass nonsense to us, so this bug
should not be triggered unless a future userland tool attempts to set a
property that we do not understand.

Reported-by: Coverity (CID-1561129)
Reported-by: Coverity (CID-1561130)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14860
2023-05-15 10:29:05 -07:00
Matthew Ahrens
3095ca91c2
Verify block pointers before writing them out
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.

However, such corruption is not typically recoverable.  To recover from
it we would need to detect the memory error before the block pointer is
written to disk.

This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.

To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.

Additionally, when corruption is detected, the raw words of the block
pointer are logged.  (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817
2023-05-08 11:20:23 -07:00
Ameer Hamza
82ac409acc
zpool import -m also removing spare and cache when log device is missing
spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14794
2023-05-03 15:10:32 -07:00
Allan Jude
8eae2d214c
Add support for zpool user properties
Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
2023-04-21 10:20:36 -07:00
rob-wing
3e4ed4213d
Create zap for root vdev
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
2023-04-20 10:07:56 -07:00
Ameer Hamza
719534ca8e
Fix "Detach spare vdev in case if resilvering does not happen"
Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14722
2023-04-19 09:04:32 -07:00
Kevin Jin
65d10bd87c
Fix short-lived txg caused by autotrim
Current autotrim causes short-lived txg through:

1. calling txg_wait_synced() in metaslab_enable()
2. calling txg_wait_open() with should_quiesce = true

This patch addresses all the issues mentioned above.

A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity.
It will be signaled once a txg is synced so that it does not change 
the original autotrim pace. Also because it is a cv, the wait is 
interruptible which speeds up the vdev_autotrim_stop_wait() call.

Finally, combining big zfs_txg_timeout, txg_wait_open() also causes
delay when exporting a pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Issue #8993
Closes #12194
2023-03-28 08:43:41 -07:00
Ameer Hamza
a05263b7aa
Update vdev state for spare vdev
zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but
it does not get updated nvlist configuration for spare vdev since
the configuration is read by spa_spares->sav_config. In this commit,
updating the vdev state for spare vdev that is consumed by zfsd on
spare disk hotplug.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14653
2023-03-24 10:30:38 -07:00
Richard Yao
d1807f168e nvpair: Constify string functions
After addressing coverity complaints involving `nvpair_name()`, the
compiler started complaining about dropping const. This lead to a rabbit
hole where not only `nvpair_name()` needed to be constified, but also
`nvpair_value_string()`, `fnvpair_value_string()` and a few other static
functions, plus variable pointers throughout the code. The result became
a fairly big change, so it has been split out into its own patch.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14612
2023-03-14 15:25:50 -07:00
Pawel Jakub Dawidek
67a1b03791
Implementation of block cloning for ZFS
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #13392
2023-03-10 11:59:53 -08:00
Richard Yao
9f08b6e31f
Sync thread should avoid holding the spa config write lock when possible
spa_sync() currently grabs the write lock due to an old hack that is
documented by a comment:

    We need the write lock here because, for aux vdevs,
    calling vdev_config_dirty() modifies sav_config.
    This is ugly and will become unnecessary when we
    eliminate the aux vdev wart by integrating all vdevs
    into the root vdev tree.
 
This has lead to deadlocks in rare edge cases from holding the write 
lock. We can reduce incidence of these deadlocks by not grabbing the 
write lock on pools without auxillary vdevs.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14282
2023-02-16 14:10:52 -08:00
Richard Yao
cfb49616cd Cleanup: spa vdev processing should check NULL pointers
The PVS Studio 2016 FreeBSD kernel report stated:

\contrib\opensolaris\uts\common\fs\zfs\spa.c (1341): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1341, 1342.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1355): error V595: The 'spa->spa_l2cache.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1355, 1357.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1398): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1398, 1408.
\sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1583): error V595: The 'oldvdevs' pointer was utilized before it was verified against nullptr. Check lines: 1583, 1595.

In practice, all of these uses were safe because a NULL pointer
implied a 0 vdev count, which kept us from iterating over vdevs.
However, rearranging the code to check the pointer first is not a
terrible micro-optimization and makes it more readable, so let us
do that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14456
2023-02-06 11:09:28 -08:00
Matthew Ahrens
018f26041d
deadlock between spa_errlog_lock and dp_config_rwlock
There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:

A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.

A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).

Note that this was introduced by #12812.

This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.

Additionally, a buffer overrun in `spa_get_errlog()` is corrected.  Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.

Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.

Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14239
Closes #14289
2022-12-22 11:48:49 -08:00
Ameer Hamza
c23738c70e
zed: Prevent special vdev to be replaced by hot spare
Special vdevs should not be replaced by a hot spare.
Log vdevs already support this, extending the
functionality for special vdevs.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14129
2022-11-04 11:33:47 -07:00
Richard Yao
44f71818f8 Silence static analyzer warnings about spa_sync_props()
Both Coverity and Clang's static analyzer complain about reading an
uninitialized intval if the property is not passed as DATA_TYPE_UINT64
in the nvlist. This is impossible becuase spa_prop_validate() already
checked this, but they are unlikely to be the last static analyzers to
complain about this, so lets just refactor the code to suppress the
warnings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14043
2022-10-19 17:10:52 -07:00
Richard Yao
1bd02680c0 Fix NULL pointer dereference in spa_open_common()
Calling spa_open() will pass a NULL pointer to spa_open_common()'s
config parameter. Under the right circumstances, we will dereference the
config parameter without doing a NULL check.

Clang's static analyzer found this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14044
2022-10-18 15:34:54 -07:00
Richard Yao
6a42939fcd
Cleanup: Address Clang's static analyzer's unused code complaints
These were categorized as the following:

 * Dead assignment		23
 * Dead increment		4
 * Dead initialization		6
 * Dead nested assignment	18

Most of these are harmless, but since actual issues can hide among them,
we correct them.

That said, there were a few return values that were being ignored that
appeared to merit some correction:

 * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
   `destroy_batched()`. We handle it by returning -1 if there is an
   error.

 * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
   `zfs_for_each()`. We handle it by doing a binary OR of the error
   value from the subsequent `zfs_for_each()` call to the existing
   value. This is how errors are mostly handled inside `zfs_for_each()`.
   The error value here is passed to exit from the zfs command, so doing
   a binary or on it is better than what we did previously.

 * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
   `dsl_prop_get_ds()` when the property is not of type string. We
   return an error when it does. There is a small concern that the
   `zfs_get_temporary_prop()` call would handle things, but in the case
   that it does not, we would be pushing an uninitialized numval onto
   the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
   anytime that `zfs_get_temporary_prop()` does, so that not giving it a
   chance to fix things is not a problem.

 * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
   `nvlist_add_nvlist()` twice in ways in which errors are expected to
   be impossible, so we switch to `fnvlist_add_nvlist()`.

A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:

 * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
   value from `describe_free()`. A look through the commit history
   revealed that this was intentional.

 * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
   returned handle from `arc_hdr_realloc()` because it is already
   referenced in lists.

 * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
   saying not to use the error from `vdev_label_init()` because whatever
   causes the error could be the reason why a detach is being done.

Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.

After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.

My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13986
2022-10-14 13:37:54 -07:00
Richard Yao
ab8d9c1783 Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.

Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.

After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:

Parameters that were originally 64-bit on Illumos:

 * dbuf_cache_max_bytes
 * dbuf_metadata_cache_max_bytes
 * l2arc_feed_min_ms
 * l2arc_feed_secs
 * l2arc_headroom
 * l2arc_headroom_boost
 * l2arc_write_boost
 * l2arc_write_max
 * metaslab_aliquot
 * metaslab_force_ganging
 * zfetch_array_rd_sz
 * zfs_arc_max
 * zfs_arc_meta_limit
 * zfs_arc_meta_min
 * zfs_arc_min
 * zfs_async_block_max_blocks
 * zfs_condense_max_obsolete_bytes
 * zfs_condense_min_mapping_bytes
 * zfs_deadman_checktime_ms
 * zfs_deadman_synctime_ms
 * zfs_initialize_chunk_size
 * zfs_initialize_value
 * zfs_lua_max_instrlimit
 * zfs_lua_max_memlimit
 * zil_slog_bulk

Parameters that were originally 32-bit on Illumos:

 * zfs_per_txg_dirty_frees_percent

Parameters that were originally `ssize_t` on Illumos:

 * zfs_immediate_write_sz

Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.

Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:

 * l2arc_rebuild_blocks_min_l2size
 * zfs_key_max_salt_uses
 * zfs_max_log_walking
 * zfs_max_logsm_summary_length
 * zfs_metaslab_max_size_cache_sec
 * zfs_min_metaslabs_to_flush
 * zfs_multihost_interval
 * zfs_unflushed_log_block_max
 * zfs_unflushed_log_block_min
 * zfs_unflushed_log_block_pct
 * zfs_unflushed_max_mem_amt
 * zfs_unflushed_max_mem_ppm

New parameters that do not exist in Illumos:

 * l2arc_trim_ahead
 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_arc_sys_free
 * zfs_deadman_ziotime_ms
 * zfs_delete_blocks
 * zfs_history_output_max
 * zfs_livelist_max_entries
 * zfs_max_async_dedup_frees
 * zfs_max_nvlist_src_size
 * zfs_rebuild_max_segment
 * zfs_rebuild_vdev_limit
 * zfs_unflushed_log_txg_max
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift
 * zfs_vnops_read_chunk_size
 * zvol_max_discard_blocks

Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.

A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:

 * zfs_delete_blocks
 * zfs_key_max_salt_uses
 * zvol_max_discard_blocks

The documentation for a few parameters was found to be incorrect:

 * zfs_deadman_checktime_ms - incorrectly documented as int
 * zfs_delete_blocks - not documented as Linux only
 * zfs_history_output_max - incorrectly documented as int
 * zfs_vnops_read_chunk_size - incorrectly documented as long
 * zvol_max_discard_blocks - incorrectly documented as ulong

The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.

In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:

 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_per_txg_dirty_frees_percent
 * zfs_unflushed_log_block_pct
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift

Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.

Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-13 10:03:29 -07:00
Ameer Hamza
55c12724d3
zed: mark disks as REMOVED when they are removed
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13797
2022-09-28 09:48:46 -07:00
Richard Yao
88b199c24e
Cleanup spa_export_common()
Coverity complains about a possible NULL pointer dereference. This is
impossible, but it suspects it because we do a NULL check against
`spa->spa_root_vdev`. This NULL check was never necessary and makes the
code harder to understand, so we drop it.

In particular, we dereference `spa->spa_root_vdev` when `new_state !=
POOL_STATE_UNINITIALIZED && !hardforce`. The first is only true when
spa_reset is called, which only occurs under fault injection.  The
second is true unless `zpool export -F $POOLNAME` is used.  Therefore,
we effectively *always* dereference the pointer. In the cases where we
do not, there is no reason to think it is unsafe.  Therefore this change
is safe to make.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13905
2022-09-27 16:45:51 -07:00
Richard Yao
fdc2d30371
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.

There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.

Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.

Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.

Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.

The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:

	* metaslab_load_pct
	* zfs_sync_taskq_batch_pct
	* zfs_zil_clean_taskq_nthr_pct
	* zfs_zil_clean_taskq_minalloc
	* zfs_zil_clean_taskq_maxalloc
	* zfs_arc_prune_task_threads

Also, negative values in those parameters was found to be harmless.

The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:

	* zfs_metaslab_switch_threshold
	* zfs_pd_bytes_max
	* zfs_livelist_min_percent_shared

zfs_multihost_history was made static to be consistent with other
parameters.

A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.

Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.

Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:

	* zfs_arc_meta_adjust_restarts
	* zfs_override_estimate_recordsize

They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.

dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.

If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.

This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.

Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models.  As a result, we change the existing parameters
that are uint32_t to use uint_t.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-27 16:42:41 -07:00
Richard Yao
d5d10f2aef
Cleanup dead spa_boot code
Unused code detected by coverity.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13868
2022-09-13 16:40:10 -07:00
Alek P
e8cf3a4f76
Implement a new type of zfs receive: corrective receive (-c)
This type of recv is used to heal corrupted data when a replica
of the data already exists (in the form of a send file for example).
With the provided send stream, corrective receive will read from
disk blocks described by the WRITE records. When any of the reads
come back with ECKSUM we use the data from the corresponding WRITE
record to rewrite the corrupted block.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes #9372
2022-07-28 15:52:46 -07:00
Tino Reichardt
1d3ba0bf01
Replace dead opensolaris.org license link
The commit replaces all findings of the link:
http://www.opensolaris.org/os/licensing with this one:
https://opensource.org/licenses/CDDL-1.0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13619
2022-07-11 14:16:13 -07:00
Finix1979
cb01da6805
Call nvlist_free before return
Fixes a small kernel memory leak which would occur if a pool failed
to import because the `DMU_POOL_VDEV_ZAP_MAP` key can't be read from 
a presumably damaged MOS config.  In the case of a missing key there 
was no leak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Finix1979 <yancw@info2soft.com>
Closes #13629
2022-07-07 11:43:58 -07:00
наб
a926aab902 Enable -Wwrite-strings
Also, fix leak from ztest_global_vars_to_zdb_args()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13348
2022-06-29 14:08:54 -07:00
Brian Behlendorf
2cd0f98f4a
Verify BPs in spa_load_verify_cb() and dsl_scan_visitbp()
We want `zpool import` to be highly robust and never panic, even
when encountering corrupt metadata.  This is already handled in the
arc_read() code path, which covers most cases, but spa_load_verify_cb()
relies on zio_read() and is responsible for verifying the block pointer.

During import it is also possible to encounter blocks pointers which
contain ZIO_COMPRESS_INHERIT and ZIO_CHECKSUM_INHERIT values.  Relax
the verification function slightly to allow this.

Futhermore, extend dsl_scan_recurse() to verify the block pointer
contents of level zero blocks which are not of type DMU_OT_DNODE or
DMU_OT_OBJSET.  This is handled by arc_read() in the other cases.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13124 
Closes #13360
2022-05-20 10:36:14 -07:00
Alexander Motin
600a02b884
Improve log spacemap load time
Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool.  As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits.  We've observed one of such systems importing pool for about
45 minutes.

This patch improves the situation from five sides:
 - By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
 - By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
 - By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
 - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
 - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.

As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12789
2022-04-26 10:44:21 -07:00
Brian Behlendorf
460748d4ae
Switch from _Noreturn to __attribute__((noreturn))
Parts of the Linux kernel build system struggle with _Noreturn.  This
results in the following warnings when building on RHEL 8.5, and likely
other environments.  Switch to using the __attribute__((noreturn)).

  warning: objtool: dbuf_free_range()+0x2b8:
    return with modified stack frame
  warning: objtool: dbuf_free_range()+0x0:
    stack state mismatch: cfa1=7+40 cfa2=7+8
  ...
  WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation
    failed, symbol will not be versioned.
  WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation
    failed, symbol will not be versioned.
  ...

Additionally, __thread_exit() has been renamed spl_thread_exit() and
made a static inline function.  This was needed because the kernel
will generate a warning for symbols which are __attribute__((noreturn))
and then exported with EXPORT_SYMBOL.

While we could continue to use _Noreturn in user space I've also
switched it to __attribute__((noreturn)) purely for consistency
throughout the code base.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13238
2022-03-23 08:51:00 -07:00
наб
861166b027 Remove bcopy(), bzero(), bcmp()
bcopy() has a confusing argument order and is actually a move, not a
copy; they're all deprecated since POSIX.1-2001 and removed in -2008,
and we shim them out to mem*() on Linux anyway

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12996
2022-03-15 15:13:42 -07:00
Alejandro Colomar
db7f1a91de
Use _Noreturn (C11; GNU89) properly
A function that returns with no value is a different thing from a
function that doesn't return at all.  Those are two orthogonal
concepts, commonly confused.

pthread_create(3) expects a pointer to a start routine that has a
very precise prototype:

    void *(*start_routine)(void *);

However, other thread functions, such as kernel ones, expect:

    void (*start_routine)(void *);

Providing a different one is incorrect, and has only been working
because the ABIs happen to produce a compatible function.

We should use '_Noreturn void', since it's the natural type, and
then provide a '_Noreturn void *' wrapper for pthread functions.

For consistency, replace most cases of __NORETURN or
__attribute__((noreturn)) by _Noreturn.  _Noreturn is understood
by -std=gnu89, so it should be safe to use everywhere.

Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136
Ref: https://software.codidact.com/posts/285972
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Closes #13120
2022-03-04 16:25:22 -08:00
Jorgen Lundman
4759342a5e
Add spa _os() hooks
Add hooks for when spa is created, exported, activated and
deactivated. Used by macOS to attach iokit, and lock
kext as busy (to stop unloads).

Userland, Linux, and, FreeBSD have empty stubs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12801
2022-02-15 15:54:25 -08:00
George Amanakis
2fb52853dc
Avoid dirtying the final TXGs when exporting a pool
There are two codepaths than can dirty final TXGs:

1) If calling spa_export_common()->spa_unload()->
   spa_unload_log_sm_flush_all() after the spa_final_txg is set, then
   spa_sync()->spa_flush_metaslabs() may end up dirtying the final
   TXGs. Then we have the following panic:
   Call Trace:
    <TASK>
    dump_stack_lvl+0x46/0x62
    spl_panic+0xea/0x102 [spl]
    dbuf_dirty+0xcd6/0x11b0 [zfs]
    zap_lockdir_impl+0x321/0x590 [zfs]
    zap_lockdir+0xed/0x150 [zfs]
    zap_update+0x69/0x250 [zfs]
    feature_sync+0x5f/0x190 [zfs]
    space_map_alloc+0x83/0xc0 [zfs]
    spa_generate_syncing_log_sm+0x10b/0x2f0 [zfs]
    spa_flush_metaslabs+0xb2/0x350 [zfs]
    spa_sync_iterate_to_convergence+0x15a/0x320 [zfs]
    spa_sync+0x2e0/0x840 [zfs]
    txg_sync_thread+0x2b1/0x3f0 [zfs]
    thread_generic_wrapper+0x62/0xa0 [spl]
    kthread+0x127/0x150
    ret_from_fork+0x22/0x30
    </TASK>

2) Calling vdev_*_stop_all() for a second time in spa_unload() after
   spa_export_common() unnecessarily delays the final TXGs beyond what
   spa_final_txg is set at.

Fix this by performing the check and call for
spa_unload_log_sm_flush_all() before the spa_final_txg is set in
spa_export_common(). Also check if the spa_final_txg has already been
set in spa_unload() and skip those calls in this case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
External-issue: https://www.illumos.org/issues/9081
Closes #13048 
Closes #13098
2022-02-15 15:48:59 -08:00
Jorgen Lundman
9a70e97fe1
Rename fallthrough to zfs_fallthrough
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #13097
2022-02-15 08:58:59 -08:00
Alexander Motin
f2c5bc150e
Add more control/visibility to spa_load_verify().
Use error thresholds from policy to control whether to scrub data
and/or metadata.  If threshold is set to UINT64_MAX, then caller
probably does not care about result and we may skip that part.

By default import neither set the data error threshold nor read
the error counter, so skip the data scrub for faster import.
Metadata are still scrubbed and fail if even single error found.

While there just for symmetry return number of metadata errors in
case threshold is not set to zero and we haven't reached it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #13022
2022-02-04 13:06:38 -08:00
наб
7ada752a93 Clean up CSTYLEDs
69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1)
had a useful policy regarding
  CALL(ARG1,
  	ARG2,
  	ARG3);
above 2 lines. As it stands, it spits out *both*
  sysctl_os.c: 385: continuation line should be indented by 4 spaces
  sysctl_os.c: 385: indent by spaces instead of tabs
which is very cool

Another >10 could be fixed by removing "ulong" &al. handling.
I don't foresee anyone actually using it intentionally
(does it even exist in modern headers? why did it in the first place?).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12993
2022-01-26 11:38:52 -08:00
наб
18168da727
module/*.ko: prune .data, global .rodata
Evaluated every variable that lives in .data (and globals in .rodata)
in the kernel modules, and constified/eliminated/localised them
appropriately. This means that all read-only data is now actually
read-only data, and, if possible, at file scope. A lot of previously-
global-symbols became inlinable (and inlined!) constants. Probably
not in a big Wowee Performance Moment, but hey.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12899
2022-01-14 15:37:55 -08:00
наб
14e4e3cb9f module: zfs: fix unused, remove argsused
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12844
2021-12-23 09:42:47 -08:00
наб
36542b065d module: zfs: spa: shim out vdev_count_verify_zaps()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12844
2021-12-23 09:36:50 -08:00