The same change has already been done for domount(). On macOS platform
we need to have access to zhp to handle devdisks and snapshots.
Also, symmetry is pleasing.
In addition, the code in zpool_disable_datasets which sorts the
mountpoints did not sort the related handle, which meant that the
mountpoint, and the handle that it is paired with, was lost.
You'd get a random handle with the mountpoint.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12296
In absence of LTO, and dynamic libatomic, la.so ends up in the needs
section of every toolchain executable; some consider this an issue.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12345Closes#12359
* Tinker with slop space accounting with dedup
Do not include the deduplicated space usage in the slop space
reservation, it leads to surprising outcomes.
* Update spa_dedup_dspace sometimes
Sometimes, we get into spa_get_slop_space() with
spa_dedup_dspace=~0ULL, AKA "unset", while spa_dspace is correctly set.
So call the code to update it before we use it if we hit that case.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12271
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12279
- Remove the "SPL Version" line, the repositories have been merged
since the 0.8 release and we no longer need to ask about this.
- Simply ask for the kernel version / patch level and add a hint
about how to get this information on Linux and FreeBSD.
- Remove "Status: Triage Needed" from the template, in practice
we really haven't been using this label so let's step setting it.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #12340
Callers of zfs_file_get and zfs_file_put can corrupt the reference
counts for the file structure resulting in a panic or a soft lockup.
When zfs send/recv runs, it will add a reference count to the
open file, and begin to send or recv the stream. If the file descriptor
is closed, then when dmu_recv_stream() or dmu_send() return we will
call zfs_file_put to remove the reference we placed on the file
structure. Unfortunately, because zfs_file_put() uses the file
descriptor to lookup the file structure, it may end up finding that
the file descriptor table no longer contains the file struct, thus
leaking the file structure. Or it might end up finding a file
descriptor for a different file and blindly updating its reference
counts. Other failure modes probably exists.
This change reworks the zfs_file_[get|put] interface to not rely
on the file descriptor but instead pass the zfs_file_t pointer around.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-issue: DLPX-76119
Closes#12299
Missed a couple of strcpy() in earlier commit, this is only used with
--enable-debug.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12311
Could have gone either way with this one, either adding it to
macOS/Windows SPL, or returning it to "classic" usage with strrchr().
Since the new special way isn't really used, and only used once,
we have this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12312
Many FreeBSD disk drivers support "unmapped" I/O mode, when data
buffer represented not with a virtually contiguous KVA-mapped address
range, but with a list of physical memory pages. Originally it was
designed to do I/O from buffers without KVA mapping (unmapped). But
moving virtual addresses out of equation allows us to operate even
non-contiguous data buffers with one condition: all buffer discon-
tinuities must be aligned to memory page borders.
Doing I/O to capable GEOM device this patch traverses through non-
linear ABD buffers, validating the chunks borders. If the condition
is met, it supplies GEOM with the list of original physical memory
pages instead of copying the data into temporary contiguous buffer.
On capable hardware on pools with ashift=12 and default ABD chunk of
4KB it should handle all the I/O without additional memory copying.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#12320
It makes no sense to set it below PAGE_SIZE, since it increases all
overheads and makes returning memory to OS problematic. It makes no
sense to set it above PAGE_SIZE, since such allocations and especially
frees are too expensive and cause KVA fragmentation to benefit from
fewer chunks. After that it makes no sense to keep more complicated
math here.
What may have sense though is just a tunable border between linear and
scatter ABDs, previously also controlled by this tunable. Retain that
functionality by taking abd_scatter_min_size tunable from Linux, just
with different default value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#12328
This dramatically reduces the lock contention on systems with slower
(non-TSC) timecounters. With TSC the difference is minimal, but since
this lock is pretty congested, any improvement counts. Plus I don't
see any reason to do it under the lock other than the latency of the
lock itself, which this change actually reduces.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12281
This is a potentially arguable change, because it removes some
compatibility cruft that certain systems or people may have come to rely
on (either a very long time ago, or unwisely in recent times).
On the other hand, it's been literally over a decade since OpenZFS
switched to the strategy of using opaque numbered /dev/zd* device nodes,
with the canonical zvol access path being a directory tree of symlinks
created by udev rules inside /dev/zvol/*. (See #102.) Even at the time,
the /dev/* scheme was labeled as being for "compatibility".
This commit removes the second tree of symlinks located directly at
/dev/*, under the assumption that anybody with any sense has been using
the intended /dev/zvol/* path for a very very long time now.
(The more I think about this, the more I anticipate that some large
fraction of people will have been blissfully unaware that the intention
has been for them to use the /dev/zvol/* tree all along, and they will
have come to rely upon the /dev/* tree simply because it's been there
this whole time despite being a compat thing.)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12303
Currently, there are several places in zvol_id where the program logic
returns particular errno values, or even particular ioctl return values,
as the program exit status, rather than a straightforward system of
explicit zero on success and explicit nonzero value(s) on failure.
This is problematic for multiple reasons. One particularly interesting
problem that can arise, is that if any of these values happens to have
all 8 least significant bits unset (i.e., it is a positive or negative
multiple of 256), then although the C program sees a nonzero int value
(presumed to be a failure exit status), the actual exit status as seen
by the system is only the bottom 8 bits of that integer: zero.
This can happen in practice, and I have encountered it myself. In a
particularly weird situation, the zvol_open code in the zfs kernel
module was behaving in such a manner that it caused the open() syscall
to fail and for errno to be set to a kernel-private value (ERESTARTSYS,
which happens to be defined as 512). It turns out that 512 is evenly
divisible by 256; or, in other words, its least significant 8 bits are
all-zero. So even though zvol_id believed it was returning a nonzero
(failure) exit status of 512, the system modulo'd that value by 256,
resulting in the actual exit status visible by other programs being 0!
This actually-zero (non-failure) exit status caused problems: udev
believed that the program was operating successfully, when in fact it
was attempting to indicate failure via a nonzero exit status integer.
Combined with another problem, this led to the creation of nonsense
symlinks for zvol dev nodes by udev.
Let's get rid of all this problematic logic, and simply return
EXIT_SUCCESS (0) is everything went fine, and EXIT_FAILURE (1) if
anything went wrong.
Additionally, let's clarify some of the variable names (error is similar
to errno, etc) and clean up the overall program flow a bit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
The zvol_id program is invoked by udev, via a PROGRAM key in the
60-zvol.rules.in rule file, to determine the "pretty" /dev/zvol/*
symlink paths paths that should be generated for each opaquely named
/dev/zd* dev node.
The udev rule uses the PROGRAM key, followed by a SYMLINK+= assignment
containing the %c substitution, to collect the program's stdout and then
"paste" it directly into the name of the symlink(s) to be created.
Unfortunately, as currently written, zvol_id outputs both its intended
output (a single string representing the symlink path that should be
created to refer to the name of the dataset whose /dev/zd* path is
given) AND its error messages (if any) to stdout.
When processing PROGRAM keys (and others, such as IMPORT{program}), udev
uses only the data written to stdout for functional purposes. Any data
written to stderr is used solely for the purposes of logging (if udev's
log_level is set to debug).
The unintended consequence of this is as follows: if zvol_id encounters
an error condition; and then udev fails to halt processing of the
current rule (either because zvol_id didn't return a nonzero exit
status, or because the PROGRAM key in the rule wasn't written properly
to result in a "non-match" condition that would stop the current rule on
a nonzero exit); then udev will create a space-delimited list of symlink
names derived directly from the words of the error message string!
I've observed this exact behavior on my own system, in a situation where
the open() syscall on /dev/zd* dev nodes was failing sporadically (for
reasons that aren't especially relevant here). Because the open() call
failed, zvol_id printed "Unable to open device file: /dev/zd736\n" to
stdout and then exited.
The udev rule finished with SYMLINK+="zvol/%c %c". Assuming a volume
name like pool/foo/bar, this would ordinarily expand to
SYMLINK+="zvol/pool/foo/bar pool/foo/bar"
and would cause symlinks to be created like this:
/dev/zvol/pool/foo/bar -> /dev/zd736
/dev/pool/foo/bar -> /dev/zd736
But because of the combination of error messages being printed to
stdout, and the udev syntax freely accepting a space-delimited sequence
of names in this context, the error message string
"Unable to open device file: /dev/zd736\n"
in reality expanded to
SYMLINK+="zvol/Unable to open device file: /dev/zd736"
which caused the following symlinks to actually be created:
/dev/zvol/Unable -> /dev/zd736
/dev/to -> /dev/zd736
/dev/open -> /dev/zd736
/dev/device -> /dev/zd736
/dev/file: -> /dev/zd736
/dev//dev/zd736 -> /dev/zd736
(And, because multiple zvols had open() syscall errors, multiple zvols
attempted to claim several of those symlink names, resulting in numerous
udev errors and timeouts and general chaos.)
This commit rectifies all this silliness by simply printing error
messages to stderr, as Dennis Ritchie originally intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
Assignment syntax (=) can be used for the PROGRAM key. But the PROGRAM
key is really a match key, not an assign key. The internal logic used by
udev to decide whether a PROGRAM key "matched" or not (which determines
whether the remainder of the rule is evaluated) depends on whether the
operator was OP_MATCH (==) or OP_NOMATCH (!=). [1]
The man page claims that '"=", ":=", and "+=" have the same effect as
"=="' for PROGRAM keys. And, after a brief perusal, the udev source code
does seem to confirm that operators other than OP_MATCH (==) or
OP_NOMATCH (!=) are implicitly converted to OP_MATCH (==). [2] But it's
not entirely clear that this is definitely the case: anecdotal testing
seems to indicate that when OP_ASSIGN (=) is used, the program's exit
status is disregarded and the remainder of the rule is processed
regardless of whether it was, in fact, a successful exit.
The bottom line here is that, if zvol_id hits some snag and returns a
nonzero exit status, then we almost certainly do NOT want to continue on
with the rule and use whatever the stdout contents may have been to
mindlessly create /dev/zvol/* symlinks. Therefore, let's be extra-sure
and use the match (==) operator explicitly, to eliminate any possibility
that udev might do the wrong thing, and ensure that a nonzero exit
status will definitely short-circuit the rest of the rule, bypassing the
SYMLINK+= assignments.
[1]
udev,
file src/udev/udev-rules.c,
func udev_rule_apply_token_to_event,
switch case TK_M_PROGRAM if r != 0 (nonzero exit status):
return token->op == OP_NOMATCH;
switch case TK_M_PROGRAM if r == 0 (zero exit status):
return token->op == OP_MATCH;
func retval 0 => key is considered to have matched
func retval 1 => key is considered to have NOT matched
[2]
udev,
file src/udev/udev-rules.c,
func parse_token,
at func start:
bool is_match = IN_SET(op, OP_MATCH, OP_NOMATCH);
in else-if case streq(key, "PROGRAM"):
if (!is_match) op = OP_MATCH;
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
The $tempnode substitution is so old that it's not even mentioned in the
man page anymore. It is still technically supported by udev, but with
plenty of "deprecated" comments surrounding it.
The preferred modern equivalent of $tempnode is $devnode (or
alternatively, %N).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
This file is old as dirt. It's entirely possible that commas were
optional in udev back at that time. But they're definitely supposed to
be there nowadays.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
This field is used only by illumos mdb. On other platforms it only
increases the struct size from 32 to 40 bytes. For struct vdev_queue
including 13 instances of avl_tree_t size means active cache lines.
Keep the padding in user-space for now to not break the ABI.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12290
With default dbuf cache size of 1/32 of ARC, it makes no sense to have
hash table of the same size (or even bigger on Linux). Reduce it to
1/8 of ARC's one, still leaving some slack, assuming higher I/O rate
via dbuf cache than via ARC.
Remove padding from ARC hash locks array. The idea behind padding
is to avoid false sharing between locks. It would have sense if
there would be a limited number of very busy locks. But since we
have no limit on the number, using the same memory for more locks we
can achieve even lower lock contention with the same false sharing,
or we can use less memory for the same contention level.
Reduce number of hash locks from 8192 to 2048. The number is still
big enough to not cause contention, but reduced memory size improves
cache hit rate for mutex_tryenter() in ARC eviction thread, saving
about 1% of the thread time.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12289
Fix a leak of abd_t that manifested mostly when using
raidzN with at least as many columns as N (e.g. a
four-disk raidz2 but not a three-disk raidz2).
Sufficiently heavy raidz use would eventually run a system
out of memory.
Additionally:
* Switch abd_cache arena to FIRSTFIT, which empirically
improves perofrmance.
* Make abd_chunk_cache more performant and debuggable.
* Allocate the abd_zero_buf from abd_chunk_cache rather
than the heap.
* Don't try to reap non-existent qcaches in abd_cache arena.
* KM_PUSHPAGE->KM_SLEEP when allocating chunks from their
own arena
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: Sean Doran <smd@use.net>
Closes#12295
dmu_zfetch_stream_fini() is missing calls to destroy the refcounts,
leaking them and the mutex inside.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12294
Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in
original code. Extra parameter "txg" is added for txg_kick(), thus it
knows which txg to kick. Also txg_kick() call is moved from
dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can
know the txg number assigned for txg_kick().
Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is
also cleaned up.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Closes#12274
The only reason for spa_config_*() to use refcount instead of simple
non-atomic (thanks to scl_lock) variable for scl_count is tracking,
hard disabled for the last 8 years. Switch to simple int scl_count
reduces the lock hold time by avoiding atomic, plus makes structure
fit into single cache line, reducing the locks contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12287
This enables ZED to auto-online vdevs that are not wholedisk managed by
ZFS.
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Increase the Linux-Maximum version in the META file to 5.13.
All of the required compatibility patches have been merged
and the 5.13 kernel has been officially released.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit 6fc3099 broke the quoting when invoking the mail program, revert
that change.
Signed-off-by: Laurențiu Nicola <lnicola@dend.ro>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
LLVM's Polly (ISL to be precise) is unhappy with the loop from
ddt_stat_add():
CC [M] fs/zfs/zfs/ddt.o
../lib/External/isl/isl_schedule_node.c:2470: cannot insert node
between set or sequence node and its filter children
(building with the custom patch which adds Polly support to Kbuild)
The mentioned loop is rather suboptimal. All that we need is to just
treat ddt_stat_t as an array of u64 and perform 1:1 addition or
substraction. This can be done in simpler for-loop with the
determined index and bounds. Compiler will expand d_end - d into
a number of ddt_stat_t fields at compile time.
This prevents Polly from failing on this file.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes#12253
The number of sublists in a multilist is relatively small. We dont need
64 bits to calculate an index. 32 bits is sufficient and makes the
code more efficient.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12288
plymouth --command splits the command on spaces which means
that zfs-load-key was getting the filesystem name enclosed
in single quotes (since 13c59bb76) and failing. This commit
fixes it by piping the password directly to the command
similar to how it's done in other scripts (initramfs,
dracut without plymouth).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michal Vasilek <michal@vasilek.cz>
Related-to: #9193
Related-to: #9202Closes#12147
The stock zstd code expects some helpers from ASAN if present.
This works fine in userland, but in kernel, KASAN also gets detected,
and lacks those helpers. So let's make some empty substitutes for
that case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12232
While abd_verify() does nothing when built without debug, compiler
can't optimize it out by itself due to calls to external list_*()
and abd_verify_scatter(). This commit makes it explicit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12280
prng32_bounded() is available to kernel only on FreeBSD 13+.
Call inline random_get_pseudo_bytes() with correct pointer type.
To be consistent, apply to Linux as well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#12282
Unlike most other properties the 'compatibility' property is stored
in the pool config object and not the DMU_OT_POOL_PROPS object.
This had the advantage that the compatibility information is available
without needing to fully import the pool (it can be read with zdb).
However, this means we need to make sure to update both the copy of
the config in the MOS and the cache file. This wasn't being done.
This commit adds a call to spa_async_request() to ensure the copy of
the config in the cache file gets updated as well as the one stored
in the pool. This same change is made for the 'comment' property
which suffers from the same inconsistency.
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Colm Buckley <colm@tuatha.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12261Closes#12276
A couple flags weren't being copied in the case where we're doing size
estimation on a resume.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes: #12266
According to current zfs man page zfs_metaslab_mem_limit should be
25 instead of 75.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: jumbi77@users.noreply.github.comCloses#12273
zstreamdump was replaced with "zstream dump"; let's stop using the
old name, compat symlink or no.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12277
Libera have made a webchat client available. This change builds on #12127.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jonathon Fernyhough <jonathon@m2x.dev>
Closes#12251
Compiling with gcc 11.1.0 produces three new warnings.
Change the code slightly to avoid them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#12130Closes#12188Closes#12237
The receive-o-x_props_override test case reliably fails on the
FreeBSD main builders (but not on Linux), until the root cause is
understood add this test to the FreeBSD exception list.
On Linux the alloc_class_012_pos test case may occasionally fail.
This is a known false positive which has also been added to the
Linux exception list until the test can be made entirely reliable.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12272
ZFS loves using %llu for uint64_t, but that requires a cast to not
be noisy - which is even done in many, though not all, places.
Also a couple places used %u for uint64_t, which were promoted
to %llu.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12233
This reverts commit 13fac09868.
Per the discussion in #11531, the reverted commit---which intended only
to be a cleanup commit---introduced a subtle, unintended change in
behavior.
Care was taken to partially revert and then reapply 10b3c7f5e4
which would otherwise have caused a conflict. These changes were
squashed in to this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Suggested-by: @chrisrd
Suggested-by: robn@despairlabs.com
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes#11531Closes#12227
In all places except two spa_get_random() is used for small values,
and the consumers do not require well seeded high quality values.
Switch those two exceptions directly to random_get_pseudo_bytes()
and optimize spa_get_random(), renaming it to random_in_range(),
since it is not related to SPA or ZFS in general.
On FreeBSD directly map random_in_range() to new prng32_bounded() KPI
added in FreeBSD 13. On Linux and in user-space just reduce the type
used to uint32_t to avoid more expensive 64bit division.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12183
Increase the Linux-Maximum version in the META file to 5.12.
All of the required compatibility patches have been merged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
wmsum was designed exactly for cases like these with many updates
and rare reads. It allows to completely avoid atomic operations on
congested global variables.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12172
In case we have I/O and try to remove an L2ARC device a deadlock might
occur. arc_read()->zio_read()->zfs_blkptr_verify() waits for SCL_VDEV
to be dropped while holding the hash_lock. However, spa_l2cache_load()
holds SCL_ALL and waits for the hash_lock in l2arc_evict().
Fix this by moving zfs_blkptr_verify() to the top top arc_read() before
the hash_lock is taken. Verify the block pointer and return a checksum
error if damaged rather than halting the system, by using
BLK_VERIFY_LOG instead of BLK_VERIFY_HALT.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#12054
It turns out that symlinks are heavily used on Linux in /dev/disk.
So let's allow importing from them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12238
Turns out $ZPOOL_IMPORT_OPTS expands in a shell-like fashion,
yielding 'import' '-aN' '-o' 'cachefile=none' for an unset variable,
and 'import' '-aN' '-o' 'cachefile=none' 'word1' 'word2' for a
white-spaced one, but ${ZPOOL_IMPORT_OPTS} expands like "${Z_I_O}"
would in a shell, yielding 'import' '-aN' '-o' 'cachefile=none' ''
(empty) and 'import' '-aN' '-o' 'cachefile=none' 'word1 word2' (spaced)
Fixes eec5ba113e "dracut: 90zfs: respect
zfs_force=1 on systemd systems"
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes: #12231
vdev_draid_min_asize() returns the minimum size of a child vdev. This
is used when determining if a disk is big enough to replace a child.
It's also used by zdb to determine how big of a child to make to test
replacement.
vdev_draid_min_asize() says that the child’s asize has to be at least
1/Nth of the entire draid’s asize, which is the same logic as raidz.
However, this contradicts the code in vdev_draid_open(), which
calculates the draid’s asize based on a reduced child size:
An additional 32MB of scratch space is reserved at the end of each
child for use by the dRAID expansion feature
So the problem is that you can replace a draid disk with one that’s
vdev_draid_min_asize(), but it actually needs to be larger to accommodate
the additional 32MB. The replacement is allowed and everything works at
first (since the reserved space is at the end, and we don’t try to use
it yet), but when you try to close and reopen the pool,
vdev_draid_open() calculates a smaller asize for the draid, because of
the smaller leaf, which is not allowed.
I think the confusion is that vdev_draid_min_asize() is correctly
returning the amount of required *allocatable* space in a leaf, but the
actual *size* of the leaf needs to be at least 32MB more than that.
ztest_vdev_attach_detach() assumes that it can attach that size of
device, and it actually can (the kernel/libzpool accepts it), but it
then later causes zdb to not be able to open the pool.
This commit changes vdev_draid_min_asize() to return the required size
of the leaf, not the size that draid will make available to the metaslab
allocator.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#11459Closes#12221
In zfs_znode_alloc we always hash inodes. If the
znode is unlinked, we do not need to hash it. This
fixes the problem where zfs_suspend_fs is doing zrele
(iput) in an async fashion, and zfs_resume_fs unlinked
drain processing will try to hash an inode that could
still be hashed, resulting in a panic.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alan Somers <asomers@gmail.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#9741Closes#11223Closes#11648Closes#12210