- mmp_concurrent_import: added test case to verify that concurrent
import correctness. The pool may only be imported once.
- mmp_exported_import: an activity check is now required for pools
which were cleanly exported if the system and pool hostids don't
match.
- mmp_inactive_import: an activity check is now required for any
pool which wasn't cleanly exported, even if the system and pool
hostids match.
- mmp_on_uberblocks: updated expected uberblocks to take in to account
the value MMP_INTERVAL_DEFAULT is set too.
- mmp_reset_interval: reduce the number of iterations from 10 to 3.
This is sufficient to verify functionality and significantly speeds
up the test.
- mmp_on_uberblocks: adjust the thresholds and increase the runtime
to avoid false positives observed in CI.
- Update tests to use 'zhack action idle' instead of ztest to improve
the reliability of the tests.
- Add additional log_note messages to test cases which have multiple
verification steps to make it clear which portion of a test failed
when reviewing the logs.
- Replace default_setup/cleanup_noexit calls with 'zpool create' and
'zpool destroy' calls to avoid additional unnecessary dataset
creation work.
- Update activity/noactivity check helper functions to use the
ZFS_LOAD_INFO_DEBUG information now available from 'zpool import'
to determine if this activity check ran and why. This is more
reliable in the CI than measuring the runtime.
- Removed all mmp tests from the zts-report.py exceptions list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
As part of SPA_LOAD_IMPORT add an additional activity check to
detect simultaneous imports from different hosts. This check is
only required when the timing is such that there's no activity
for the the read-only tryimport check to detect. This extra
safety chceck operates as follows:
1. Repeats the following MMP check 10 times:
a. Write out an MMP uberblock with the best txg and a random
sequence id to all primary pool vdevs.
b. Verify a minimum number of good writes such that even if
the pool appears degraded on the remote host it will see
at least one of the updated MMP uberblocks.
c. Wait for the MMP interval this leaves a window for other
racing hosts to make similar modifications which can be
detected.
d. Call vdev_uberblock_load() to determine the best uberblock
to use, this should be the MMP uberblock just written.
e. Verify the txg and random sequeunce number match the MMP
uberblock written in 1a.
2. Restore the original MMP uberblocks. This allows the check
to be performed again if the pool fails to import for an
unrelated reason.
This change also includes some refactoring and minor improvements.
- Never try loading earlier txgs during import when the import
fails with EREMOTEIO or EINTER. These errors don't indicate
the txg is damaged but instead that its either in use on a
remote host or the import was interactively cancelled. No
rewind is also performed for EBADD which can result from a
stale trusted config when doing a verbatim import.
- Refactor the code for consistent logging of the multihost
activity check using spa_load_note() and console messages
indicating when the activity check was trigger and the result.
- Added MMP_*_MASK and MMP_SEQ_CLEAR() macros to allow easier
modification of the sequence number in an uberblock.
- Added ZFS_LOAD_INFO_DEBUG environment variable which can be
set to log to dump to stdout the spa_load_info nvlist returned
during import. This is used by the updated mmp test cases
to determine if an activity check was run and its result.
- Standardize the mmp messages similarly to make it easier to
find all the relevent mmp lines in the debug log.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Tryimport adds a unique prefix to the pool name to avoid name
collisions. This makes it awkward to log user-friendly info
during a tryimport. Add a spa_load_name() function which can
be used to report the unmodified pool name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Move the "Starting import" log message in to the import block so
it's matched with the "Fiinshed importing" debug message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
For a cleanly exported pools there exists a small window where
both systems may determine it's safe to import the pool and skip
the activity check. Only allow the check to be skipped when the
last imported hostid matches the systems hostid and the pool was
cleanly exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
They aren't used outside these very small blocks, and their initial
values are never used at all.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/Closes#17551
Free issue threads might block waiting for synchronous DDT, BRT or
GANG header reads. So unlike other taskqs using ZTI_SCALE to scale
with number of CPUs, here we also need some amount of threads to
potentially saturate pool reads. I am not sure we always want the
96 threads we had before ZTI_SCALE introduction at #11966 on small
systems, but lets make it at least 32.
While here, make free taskqs configurable, similar to read and
write ones.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com>
Closes#17903
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.
This upstreams commit fd141584cf89d7d2 from FreeBSD src.
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes#17489
I've noticed that after some dedup tests system reboot ends up in
assertion about ms_defer tree not free. It seems to be caused by
DDT flushing still freeing some blocks while ZFS trying to reach
a final steady state due to spa_final_txg, while being set by
spa_export_common() on pool export, is not set when spa_unload()
is called by spa_evict_all() on system reboot/shutdown. Setting
spa_final_txg in spa_unload() fixes this issue.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#17395
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk. This commit checks for that condition and disallows it.
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #16547Closes: #17231
(cherry picked from commit f40ab9e399)
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17187
This helps to avoids confusion with the similarly-named
txg_wait_synced().
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
People have noted there's no way to remove a pool userprop, only zero
it. Turns vdev userprops had a method, by setting empty-string. So this
makes pool userprops follow the same behaviour.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16887
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#17071
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.
Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Similar to what we saw in #16569, we need to consider that a
replacing vdev should not be considered as fully contributing
to the redundancy of a raidz vdev even though current IO has
enough redundancy.
When a failed vdev_probe() is faulting a disk, it now checks
if that disk is required, and if so it suspends the pool until
the admin can return the missing disks.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16864
cstyle can handle these cases now, so we don't need to disable it.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#16840
In some cases like dsl_dataset_hold_obj() it is possible to handle
those errors, so failure to hold dataset should be better than
kernel panic. Some other places where these errors are still not
handled but asserted should be less dangerous just as unreachable.
We have a user report about pool corruption leading to assertions
on these errors. Hopefully this will make behavior a bit nicer.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16836
Some users might want to scrub only new data because they would like
to know if the new write wasn't corrupted. This PR adds possibility
scrub only newly written data.
This introduces new `last_scrubbed_txg` property, indicating the
transaction group (TXG) up to which the most recent scrub operation
has checked and repaired the dataset, so users can run scrub only
from the last saved point. We use a scn_max_txg and scn_min_txg
which are already built into scrub, to accomplish that.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Closes#16301
Without doing that there is a race window on export when history
log write by completed rebuild dirties transaction beyond final,
triggering assertion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16714Closes#16782
This patch is preparatory work for long name feature. It changes all
users of zap_attribute_t to allocate it from kmem instead of stack. It
also make zap_attribute_t and zap_name_t structure variable length.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#15921
All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their
own preallocated nvlist (except ztest), so we can remove the option to
have them allocate one if none is supplied.
This sidesteps a bug in spa_prop_get(), where the error var wasn't
initialised, which could lead to the provided nvlist being freed at the
end.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16505
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.
This change also adds some general tests for zpool-reguid(8).
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Report the correct error message in libzfs when attaching/replacing a
vdev with a higher ashift.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16449
Spare and l2cache vdev labels are not updated during import. Therefore,
if disk paths are updated between pool export and import, the AUX label
still shows the old paths. This patch syncs the AUX label
during import to show the correct path information.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15817
This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.
Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Closes#15890
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes#15889
- Add old eviction for special and dedup metaslab classes. Those
vdevs may be potentially big and fragmented with large metaslabs,
while their asynchronous write pattern is not really different
from normal class. It seems an omission to not evict old metaslabs
from them.
- If we have metaslab preload enabled, which means we are not too
low on memory, do not evict active metaslabs even if they are not
used for some time. Eviction of active metaslabs means we won't
be able to write anything until we load them, that may take some
time, that is straight opposite to metaslab preload goals. For
small systems the memory saving should be less important after
recent reduction in number of allocators and so open metaslabs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16214
On 32-bit platforms long long is generally 64-bits. Sufficiently modern
versions of gcc (13 in my testing) complains when casting a pointer to
an integer of a different width so cast to uintptr_t first to avoid the
warning.
Fixes: c183d164aa Parallel pool import
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#16203
This renames it to spa_taskq_dispatch(), and reduces and simplifies its
arguments based on these observations from its two call sites:
- arg is always the zio, so it can be typed that way, and we don't need
to provide it twice;
- ent is always &zio->io_tqent, and zio is always provided, so we can
use it directly;
- the only flag used is TQ_FRONT, which can just be a bool;
- zio != NULL was part of the "use allocator" test, but it never would
have got that far, because that arg was only set to NULL in the
reexecute path, which is forced to type CLAIM, so the condition would
fail at t == WRITE anyway.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16151
It is the only user of spa_taskq_dispatch_select(), so might as well
just carry it directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16151
It has no callers anymore.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16151
If we wait until after we check for no spa references to drop the
namespace lock, then we know that spa consumers will need to call
spa_lookup() and end up waiting on the spa_namespace_cv until we
finish. This narrows the external checks to spa_lookup and we no
longer need to worry about the spa_vdev_enter case.
Sponsored-By: Klara Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16153
Changed spa_export_common() such that it no longer holds the
spa_namespace_lock for the entire duration and instead sets
spa_export_thread to indicate an import is in progress on the
spa. This allows for an export to a diffent pool to proceed
in parallel while an export is still processing potentially
long operations like spa_unload_log_sm_flush_all().
Calls like spa_lookup() and spa_vdev_enter() that rely on
the spa_namespace_lock to serialize them against a concurrent
export, now wait for any in-progress export thread to complete
before proceeding.
The 'zpool import -a' sub-command also provides multi-threaded
support, using a thread pool to submit the exports in parallel.
Sponsored-By: Klara Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16153
High priority threads are handling ZIL writes. While there is no
ZIL compression, there is encryption, checksuming and RAIDZ math.
We've found that on large systems 1 taskq with 5 threads can be
a bottleneck for throughput, IOPS or both. Instead of just bumping
number of threads with a risk of overloading CPUs and increasing
latency, switch to using TQ_FRONT mechanism to increase sync write
requests priority within standard write threads. Do not do it on
Illumos, since its TQ_FRONT implementation is inherently unfair.
FreeBSD and Linux don't have this problem, so we can do it there.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#16146
- Reduce number of allocators on small system down to one per 4
CPU cores, keeping maximum at 4 on 16+ core systems. Small systems
should not have the lock contention multiple allocators supposed
to solve, while having several metaslabs open and modified each
TXG is not free.
- Reduce number of write issue taskqs down to one per 16 CPU
cores and an integer fraction of number of allocators. On mid-
sized systems, where multiple allocators already make sense, too
many write issue taskqs may reduce write speed on single-file
workloads, since single file is handled by only one taskq to
reduce fragmentation. On large systems, that can actually benefit
from many taskq's better IOPS, the bottleneck is less important,
since in worst case there will be at least 16 cores to handle it.
- Distribute dnodes between allocators (and taskqs) in a round-
robin fashion instead of relying on sync taskqs to be balanced.
The last is not guarantied and may depend on scheduling.
- Remove io_wr_iss_tq from struct zio. io_allocator is enough.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16130
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.
Also allow zpool clear if no evidence of another host
is using the pool.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15839
There is no reason for these module parameters to be read-only.
Being modified they just apply on next pool import/creation, that
is useful for testing different values.
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#16118
This commit allow spa_load() to drop the spa_namespace_lock so
that imports can happen concurrently. Prior to dropping the
spa_namespace_lock, the import logic will set the spa_load_thread
value to track the thread which is doing the import.
Consumers of spa_lookup() retain the same behavior by blocking
when either a thread is holding the spa_namespace_lock or the
spa_load_thread value is set. This will ensure that critical
concurrent operations cannot take place while a pool is being
imported.
The zpool command is also enhanced to provide multi-threaded support
when invoking zpool import -a.
Lastly, zinject provides a mechanism to insert artificial delays
when importing a pool and new zfs tests are added to verify parallel
import functionality.
Contributions-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#16093
The only possible ioctl is a flush, and any other kind of meta-operation
introduced in the future is likely to have different semantics (much
like trim did). So, lets just call it what it is.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16064
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.
This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:
--allow-in-use
--allow-ashift-mismatch
--allow-replicaton-mismatch
The force flag will disable all of these checks.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#15509
There exist a couple of macros that are used to update the blkptr birth
times but they can often be confusing. For example, the
BP_PHYSICAL_BIRTH() macro will provide either the physical birth time
if it is set or else return back the logical birth time. The
complement to this macro is BP_SET_BIRTH() which will set the logical
birth time and set the physical birth time if they are not the same.
Consumers may get confused when they are trying to get the physical
birth time and use the BP_PHYSICAL_BIRTH() macro only to find out that
the logical birth time is what is actually returned.
This change cleans up these macros and makes them symmetrical. The same
functionally is preserved but the name is changed. Instead of calling
BP_PHYSICAL_BIRTH(), consumer can now call BP_GET_BIRTH(). In
additional to cleaning up this naming conventions, two new sets of
macros are introduced -- BP_[SET|GET]_LOGICAL_BIRTH() and
BP_[SET|GET]_PHYSICAL_BIRTH. These new macros allow the consumer to
get and set the specific birth time.
As part of the cleanup, the unused GRID macros have been removed and
that portion of the blkptr are currently unused.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#15962
For FreeBSD sysctls, we don't want the extra newline, since the
sysctl(8) utility will format strings appropriately.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15719
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by
sbuf_new_for_sysctl(). In particular, this code triggers an assertion
failure in sbuf_clear().
Simplify by just using sysctl_handle_string() for both reading and
setting the tunable.
Fixes: 6930ecbb7 ("spa: make read/write queues configurable")
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#15719
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.
This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.
This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.
Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15675
Detail the import progress of log spacemaps as they can take a very
long time. Also grab the spa_note() messages to, as they provide
insight into what is happening
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Closes#15539
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15536Closes#15564
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022