This adds zfs_valstr, a collection of pretty printers for bitfields and
enums. These are useful in debugging, logging and other display contexts
where raw values are difficult for the untrained (or even trained!) eye
to decipher.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Requires the new 'flat' physical data which has the start
time for a class entry.
The amount to prune can be based on a target percentage of
the unique entries or based on the age (i.e., every entry
older than N days).
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#16277
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.
This change also adds some general tests for zpool-reguid(8).
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
This is the supporting infrastructure for the upcoming dedup features.
Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.
The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.
This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.
This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.
For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.
ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.
This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes#15892
Report the correct error message in libzfs when attaching/replacing a
vdev with a higher ashift.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#16449
When importing multiple pools, the nvlist of properties given with "-o"
is shared amongst the several threads. So no thread should modify it.
Previously, in the course of validating the cachefile property, the
zpool_valid_proplist function would temporarily modify the value, and
then change it back. Now it will operate on a clone of the value.
Sponsored by: Axcient
Fixes#16405
Signed-off-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
This fixes things so mirrored special vdevs report themselves as
"class=special" rather than "class=normal".
This happens due to the way the vdev nvlists are constructed:
mirrored special devices - The 'mirror' vdev has allocation bias as
"special" and it's leaf vdevs are "normal"
single or RAID0 special devices - Leaf vdevs have allocation bias as
"special".
This commit adds in code to check if a leaf's parent is a "special"
vdev to see if it should also report "special".
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#16217
This commit adds support for zpool status command to displpay status
of ZFS pools in JSON format using '-j' option. Status information is
collected in nvlist which is later dumped on stdout in JSON format.
Existing options for zpool status work with '-j' flag. man page for
zpool status is updated accordingly.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#16217
This commit adds support for zpool get command to output the list of
properties for ZFS Pools and VDEVS in JSON format using '-j' option.
Man page for zpool get is updated to include '-j' option.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#16217
This commit adds support for JSON output for zfs version and zfs get
commands. '-j' flag can be used to get output in JSON format.
Information is collected in nvlist objects which is later printed in
JSON format. Existing options that work for zfs get and zfs version
also work with '-j' flag.
man pages for zfs get and zfs version are updated accordingly.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#16217
This change adds a new `zpool prefetch -t ddt $pool` command which
causes a pool's DDT to be loaded into the ARC. The primary goal is to
remove the need to "warm" a pool's cache before deduplication stops
slowing write performance. It may also provide a way to reload portions
of a DDT if they have been flushed due to inactivity.
Sponsored-by: iXsystems, Inc.
Sponsored-by: Catalogics, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Fred Weigel <fred.weigel@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Will Andrews <will.andrews@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Closes#15890
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool
When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.
dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.
Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Sean Eric Fagan <sean.fagan@klarasystems.com>
Closes#15889
Sponsored-by: Klara, Inc.
Sponsored-By: Wasabi Technology, Inc.
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Current output:
> receiving correctivefull stream of a into b
New output:
> receiving corrective full stream of a into b
Signed-off-by: glibg10b <56197853+glibg10b@users.noreply.github.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
C99 6.7.8.17 says that when an undesignated initialiser is used, only
the first element of a union is initialised. If the first element is not
the largest within the union, how the remaining space is initialised is
up to the compiler.
GCC extends the initialiser to the entire union, while Clang treats the
remainder as padding, and so initialises according to whatever
automatic/implicit initialisation rules are currently active.
When Linux is compiled with CONFIG_INIT_STACK_ALL_PATTERN,
-ftrivial-auto-var-init=pattern is added to the kernel CFLAGS. This flag
sets the policy for automatic/implicit initialisation of variables on
the stack.
Taken together, this means that when compiling under
CONFIG_INIT_STACK_ALL_PATTERN on Clang, the "zero" initialiser will only
zero the first element in a union, and the rest will be filled with a
pattern. This is significant for aes_ctx_t, which in
aes_encrypt_atomic() and aes_decrypt_atomic() is initialised to zero,
but then used as a gcm_ctx_t, which is the fifth element in the union,
and thus gets pattern initialisation. Later, it's assumed to be zero,
resulting in a hang.
As confusing and undiscoverable as it is, by the spec, we are at fault
when we initialise a structure containing a union with the zero
initializer. As such, this commit replaces these uses with an explicit
memset(0).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16135Closes#16206
If it's going to be used directly by zdb/ztest, then it sort of doesn't
make sense to carry it with the assert code.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#16181
Ever since a10d50f999, ZFS has mounted file systems in parallel when
importing a pool. It uses a fixed size of 512 for the thread pool. But
since c183d164aa, it has also imported pools in parallel. So the total
number of threads at one time is 513 * npools + 1. That can easily
exceed the system's limit on the number of threads per process, which
will cause one or more pools to be unable to allocate any worker
threads, forcing them to fallback to slow serial mounting . To
forestall that, manage the threadpool size in /sbin/zpool, not libzfs.
Use the same size (512), but divided by the number of pools.
This is a backwards-incompatible change to the libzfs abi.
Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes#16178
During parallel zpool import, /sbin/zpool will create a separate thread
pool for each pool, used to mount that pool's datasets. If the total
thread count exceed's the system's limit on threads per process, then
tpool_dispatch may fail. If it does, directly execute the mount
operation instead.
Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Closes#16178Fixes#16172
In P2ALIGN, the result would be incorrect when align is unsigned
integer and x is larger than max value of the type of align.
In that case, -(align) would be a positive integer, which means
high bits would be zero and finally stay zero after '&' when
align is converted to a larger integer type.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com>
Closes#15940
Someone came to me and pointed out that you could pretty
readily cause the refreservation calculation to exceed
2**64, given the 2**17 multiplier in it, and produce
refreservations wildly less than the actual volsize in cases where
it should have failed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#15996
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason Lee <jasonlee@lanl.gov>
Closes#16074
In `zpool status -t`, scrub date/time is reported using the C locale,
while trim time is reported using the current one. This is inconsistent.
This patch fixes that.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Maxim Filimonov <che@bein.link>
Closes#15878Closes#15879
Currently, zpool add allows users to add top-level vdevs that have
different ashifts but doing so prevents users from being able to
perform a top-level vdev removal. Often times consumers may not realize
that they have mismatched ashifts until the top-level removal fails.
This feature adds ashift validation to the zpool add command and will
fail the operation if the sector size of the specified vdev does not
match the existing pool. This behavior can be disabled by using the -f
flag. In addition, new flags have been added to provide fine-grained
control to disable specific checks. These flags
are:
--allow-in-use
--allow-ashift-mismatch
--allow-replicaton-mismatch
The force flag will disable all of these checks.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes#15509
On Linux block devices used for vdevs will by partitioned. The block
device must be large enough for an 64M partition starting at offset
of 2048 sectors (part1), and a second 64M reserved partition at the
end of the device (part9).
This commit adds a capacity check when creating the GPT label to
immediately detect a device which is too small. With the existing
code this would be caught slightly latter when attempting to use
the partition. Catching it sooner let's us print a more useful error.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#15898
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay). However, no action is taken by ZED,
like is done for checksum or I/O errors. This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
VDEV_PROP_SLOW_IO_N
VDEV_PROP_SLOW_IO_T
If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes#15469
Drop the no_memory() call from zpool_in_use() when reading the
label fails and instead return the error to the caller. This
prevents a misleading "internal error: out of memory" error
when the label can't be read. This will result in is_spare()
returning B_FALSE instead of aborting, which is already safely
handled.
Furthermore, on Linux it's possible for EREMOTEIO to returned
by an NVMe device if the device has been low-level formatted
and not rescanned. In this case we want to fallback to the
legacy scanning method and read any of the labels we can.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #13538Closes#15747
Add `zpool` flags to control the slot power to drives. This assumes
your SAS or NVMe enclosure supports slot power control via sysfs.
The new `--power` flag is added to `zpool offline|online|clear`:
zpool offline --power <pool> <device> Turn off device slot power
zpool online --power <pool> <device> Turn on device slot power
zpool clear --power <pool> [device] Turn on device slot power
If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power'
option is automatically implied for `zpool online` and `zpool clear`
and does not need to be passed.
zpool status also gets a --power option to print the slot power status.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mart Frauenlob <AllKind@fastest.cc>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15662
Check for the existence of execvpe(3) and only provide the FreeBSD
compat version if required.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#15609
With FreeBSD's switch to git the $FreeBSD$ string is no longer expanded
and they have mostly been removed upstream. Stop using __FBSDID and
remove the no-longer needed sys/cdefs.h includes.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#15527
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).
== Initiating expansion ==
A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`. The new device will become part of the RAIDZ group. A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.
The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.
== During expansion ==
The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).
The expansion progress can be monitored with `zpool status`.
Data redundancy is maintained during (and after) the expansion. If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).
The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.
== After expansion ==
When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).
Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).
A RAIDZ vdev can be expanded multiple times.
After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks. New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.
Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#15022
There have been rare cases where the VDEV_ENC_SYSFS_PATH value that zed
gets passed is stale. To mitigate this, dynamically check the sysfs
path at the time of zed event processing, and use the dynamic value if
possible. Note that there will be other times when we can not
dynamically detect the sysfs path (like if a disk disappears) and have
to rely on the old value for things like turning on the fault LED. That
is to say, we can't just blindly use the dynamic path in every case.
Also:
- Add enclosure sysfs entry when running 'zpool add'
- Fix 'slot' and 'enc' zpool.d scripts for nvme
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15462
Currently, zvol threading can be switched through the zvol_request_sync
module parameter system-wide. By making it a zvol property, zvol
threading can be switched per zvol.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#15409
ZFS prefetch is currently governed by the zfs_prefetch_disable
tunable. However, this is a module-wide settings - if a specific
dataset benefits from prefetch, while others have issue with it,
an optimal solution does not exists.
This commit introduce the "prefetch" tri-state property, which enable
granular control (at dataset/volume level) for prefetching.
This patch does not remove the zfs_prefetch_disable, which remains
a system-wide switch for enable/disable prefetch. However, to avoid
duplication, it would be preferable to deprecate and then remove
the module tunable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Co-authored-by: Gionatan Danti <g.danti@assyoma.it>
Closes#15237Closes#15436
This commit adds '-u' flag for zfs set operation. With this flag,
mountpoint, sharenfs and sharesmb properties can be updated
without actually mounting or sharing the dataset.
Previously, if dataset was unmounted, and mountpoint property was
updated, dataset was not mounted after the update. This behavior
is changed in #15240. We mount the dataset whenever mountpoint
property is updated, regardless if it's mounted or not.
To provide the user with option to keep the dataset unmounted and
still update the mountpoint without mounting the dataset, '-u'
flag can be used.
If any of mountpoint, sharenfs or sharesmb properties are updated
with '-u' flag, the property is set to desired value but the
operation to (re/un)mount and/or (re/un)share the dataset is not
performed and dataset remains as it was before.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15322
Have libzfs call a special `zfs_prepare_disk` script before a disk is
included into the pool. The user can edit this script to add things
like a disk firmware update or a disk health check. Use of the script
is totally optional. See the zfs_prepare_disk manpage for full details.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#15243
When failmode=continue is set and the pool suspends, both 'zpool status'
and the 'zfs/pool/state' kstat ignore it and report the normal vdev tree
state. There's no clear indicator that the pool is suspended. This is
unlike suspend in failmode=wait, or suspend due to MMP check failure,
which both report "SUSPENDED" explicitly.
This commit changes it so SUSPENDED is reported for failmode=continue
the same as for other modes.
Rationale:
The historical behaviour of failmode=continue is roughly, "press on as
though all is well". To this end, the fact that the pool had suspended
was not shown, to maintain the façade that all is well.
Its unclear why hiding this information was considered appropriate. One
possibility is that it was expected that a true pool fault would always
be reported as DEGRADED or FAULTED, and that the pool could not suspend
without these happening.
That is not necessarily true, as vdev health and suspend state are only
loosely connected, such that a pool in (apparent) good health can be
suspended for good reasons, and of course a degraded pool does not lead
to suspension. Even if that expectation were true, there's still a
difference in urgency - a degraded pool may not need to be attended to
for hours, while a suspended pool is most often unusable until an
operator intervenes.
An operator that has set failmode=continue has presumably done so
because their workload is one that can continue to operate in a useful
way when the pool suspends. In this case the operator still needs a
clear indicator that there is a problem that needs attending to.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes#15297
For sharesmb and sharenfs properties, the status of setting the
property is tied with whether we succeed to share the dataset or
not. In case sharing the dataset is not successful, this is
treated as overall failure of setting the property. In this case,
if we check the property after the failure, it is set to on.
This commit updates this behavior and the status of setting the
share properties is not returned as failure, when we fail to
share the dataset.
For sharenfs property, if access list is provided, the syntax
errors in access list/host adresses are not validated until after
setting the property during postfix phase while trying to
share the dataset. This is not correct, since the property has
already been set when we reach there.
Syntax errors in access list/host addresses are validated while
validating the property list, before setting the property and
failure is returned to user in this case when there are errors
in access list.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15240
There are some inconsistencies in the handling of mountpoint
property. This commit updates the behavior and makes it
consistent.
If mountpoint property is set when dataset is unmounted, this
would update the mountpoint property. The mountpoint could be
valid or invalid in this case. Setting the mountpoint property
would result in success in this case. Dataset would still be
unmounted here.
On the other hand, if dataset is mounted and mountpoint
property is updated to something invalid where mount cannot be
successful, for example, setting the mountpoint inside a readonly
directory. This would unmount the dataset, set the mountpoint
property to requested value and tries to mount the dataset. The
mount operation returns error and this error is treated as
overall failure of setting the property while the property is
actually set.
To make the behavior consistent in case dataset is mounted or
unmounted, we should try to mount the dataset whenever mountpoint
property is updated. This would result in mounting the datasets
if canmount property is set to on, regardless if the dataset was
previously unmounted.
The failure in mount operation while setting the mountpoint
property should not be treated as failure, since the property is
actually set now to user requested value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15240
For zpool import and zpool split, zpool_enable_datasets is called
to mount and share all datasets in a pool. If there is an error
while mounting or sharing any dataset in the pool, the status of
import or split is reported as failure. However, the changes do
show up in zpool list.
This commit updates the error reporting in zpool import and zpool
split path. More descriptive messages are shown to user in case
there is an error during mount or share. Errors in mount or share
do not effect the overall status of zpool import and zpool split.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes#15216
Currently redaction bookmarks and their associated redaction lists
have a relatively low limit of 36 redaction snapshots. This is imposed
by the number of snapshot GUIDs that fit in the bonus buffer of the
redaction list object. While this is more than enough for most use
cases, there are some limited cases where larger numbers would be
useful to support.
We tweak the redaction list creation code to use a spill block if
the number of redaction snapshots is above the amount that would fit
in the bonus buffer. We also make a small change to allow spill blocks
to be use for types of data besides SA. In order to fully leverage
this logic, we also change the redaction code to use vmem_alloc, to
handle extremely large allocations if needed. Finally, small tweaks
were made to the zfs commands and the test suite.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#15018
POSIX timers target the process, not the thread (as does SIGINFO),
so we need to block it in the main thread which will die if interrupted.
Ref: https://101010.pl/@ed1conf@bsd.network/110731819189629373
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#15113
When the vdev properties features was merged an extra check
was added in `spa_vdev_remove_top_check()` which checked
whether the vdev that we want to remove is already being
removed and if so return an EALREADY error.
```
static int
spa_vdev_remove_top_check(vdev_t *vd)
{
... <snip> ...
/*
* This device is already being removed
*/
if (vd->vdev_removing)
return (SET_ERROR(EALREADY));
```
Before that change we'd still fail with an error but it
was a more generic one - here is the check that failed
later in the same function:
```
/*
* There can not be a removal in progress.
*/
if (spa->spa_removing_phys.sr_state == DSS_SCANNING)
return (SET_ERROR(EBUSY));
```
Changing the error code returned from that function changed
the behavior of the removal's library interface exposed to
the userland - `spa_vdev_remove()` now returns `EZFS_UNKNOWN`
instead of `EZFS_EBUSY` that was returning before.
This patch adds logic to make `spa_vdev_remove()` mindful
of the new EALREADY code and propagating `EZFS_EBUSY`
reverting to the previously established semantics of that
function.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#15013Closes#15129
The previous code was checking zfs_is_namespace_prop() only for the
last property on the list. If one was not "namespace", then remount
wasn't called. To fix that move zfs_is_namespace_prop() inside the
loop and remount if at least one of properties was "namespace".
Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#15000
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#8995Closes#12355
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.
So let's add zpool initialize -u to clear it.
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12451Closes#14873
When the special_small_blocks property is being set during a pool
create it enforces a limit of 128KiB even if the pool's record size
is larger.
If the recordsize property is being set during a pool create, then
use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes#13815Closes#14811
Usage:
zpool set org.freebsd:comment="this is my pool" poolname
Tests are based on zfs_set's user property tests.
Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes#11680
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.
Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2
Notably, this allows to get/set properties on the root vdev:
% zpool set user:prop=value <pool> root-0
Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):
% zpool set user:prop=value <pool> mirror-0
This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.
The following tests have been added:
- zpool get all properties from root vdev
- zpool set a property on root vdev
- verify root vdev ZAP is created
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14405