The only reason for spa_config_*() to use refcount instead of simple
non-atomic (thanks to scl_lock) variable for scl_count is tracking,
hard disabled for the last 8 years. Switch to simple int scl_count
reduces the lock hold time by avoiding atomic, plus makes structure
fit into single cache line, reducing the locks contention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12287
When scrubbing a raidz/draid pool, which contains a replacing or
sparing mirror with multiple online children, only one child will
be read. This is not normally a serious concern because the DTL
records are used to determine where a good copy of the data is.
As long as the data can be read from one child the mirror vdev
will use it to repair gaps in any of its children. Furthermore,
even if the data which was read is corrupt the raidz code will
detect this and issue its own repair I/O to correct the damage
in the mirror vdev.
However, in the scenario where the DTL is wrong due to silent
data corruption (say due to overwriting one child) and the scrub
happens to read from a child with good data, then the other damaged
mirror child will not be detected nor repaired.
While this is possible for both raidz and draid vdevs, it's most
pronounced when using draid. This is because by default the zed
will sequentially rebuild a draid pool to a distributed spare,
and the distributed spare half of the mirror is always preferred
since it delivers better performance. This means the damaged
half of the mirror will go undetected even after scrubbing.
For system administrations this behavior is non-intuitive and in
a worst case scenario could result in the only good copy of the
data being unknowingly detached from the mirror.
This change resolves the issue by reading all replacing/sparing
mirror children when scrubbing. When the BP isn't available for
verification, then compare the data buffers from each child. They
must all be identical, if not there's silent damage and an error
is returned to prompt the top-level vdev to issue a repair I/O to
rewrite the data on all of the mirror children. Since we can't
tell which child was wrong a checksum error is logged against the
replacing or sparing mirror vdev.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13555
Since we use two B-trees q_exts_by_size and q_exts_by_addr, we should
count 2x sizeof (range_seg_gap_t) per node. And since average B-tree
memory efficiency is about 75%, we should increase it to 3x.
Previous code under-counted up to 30% of the memory usage.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13537
I genuinely don't know why this didn't come up before,
but adding the LZ4 early abort pointed out this flaw,
in which we're allocating a buffer of one size, and
then telling the compressor that we're handing it buffers
of a different size, which may be Very Different - say,
allocating 512b and then telling it the inputs are 128k.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13375
It is typical, but not generally true that if log summary has more
blocks it must also have unflushed metaslabs. Normally with metaslabs
flushed in order it works, but there are known exceptions, such as
device removal or metaslab being loaded during its flush attempt.
Before 600a02b884 if spa_flush_metaslabs() hit loading metaslab it
usually stopped (unless memlimit is also exceeded), but now it may
flush more metaslabs, just skipping that particular one. This
increased chances of assertion to fire when the skipped metaslab is
flushed on next iteration if all other metaslabs in that summary
entry are already flushed out of order.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#13486Closes#13513
As of the Linux 5.19 kernel the readpage() address space operation
has been replaced by read_folio().
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
erase functionality from the blkdev_issue_discard() function.
The blkdev_issue_secure_erase() must now be issued to issue
a secure erase.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function. The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function. The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
as an argument. This removes the need for the bio_set_dev() compatibility
code for 5.18 and newer kernels.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.
In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes#13429
clang-15 emits the following error message for functions without
a prototype:
fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
a function declaration without a prototype is deprecated
in all versions of C [-Werror,-Wstrict-prototypes]
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes#13421
Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#11958Closes#12590Closes#13367
On some architectures ZERO_PAGE is unavailable because it references
a GPL exported symbol of empty_zero_page. Originally e08b993 removed
the call to PAGE_ZERO(0) for assignment to the abd_zero_page. However,
a simple check can be done to avoid a kernel allocation and free for
the abd_zero_page if ZERO_PAGE is available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#13199
This adds supports for hole-punching facilities in the FreeBSD kernel
starting from __FreeBSD_version 1400032.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ka Ho Ng <khng@FreeBSD.org>
Sponsored-by: The FreeBSD Foundation
Closes#12458
Holding a dbuf is a common operation which can become highly contended
in dbuf_find() when acquiring the dbuf hash mutex. This is particularly
true on Linux when reading/writing volumes since by default up to 32
threads from the zvol_taskq may be taking a hold of the same dbuf.
This should also be observable on FreeBSD as long as there are enough
processes accessing the volume concurrently.
This is further aggregrated by the fact that only the block id will
be unique when calculating the dbuf hash for a single volume. The
objset id, object id, and level will be the same for data blocks.
This has been observed to result in a somehwat less than uniform hash
distribution and a longer than expected max hash chain depth (~20)
on a large memory system (256 GB) using volumes.
This commit improves the siutation by switching the hash mutex to
an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS
from 2048 to 8192 to further reduce the odds of a hash collision.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13405
Clang 13.0.0 added support for `Wunused-but-set-parameter` and
`-Wunused-but-set-variable` which correctly detects two unused
variables in zstd resulting in a build failure. This commit
annotates these instances accordingly.
https://releases.llvm.org/13.0.1/tools/clang/docs/ReleaseNotes.html#id6
In FSE_createCTable(), malloc() is intentionally defined as NULL when
compiled in the kernel so the variable is unused.
zstd/lib/compress/fse_compress.c:307:12: error: variable 'size'
set but not used [-Werror,-Wunused-but-set-variable]
Additionally, in ZSTD_seqDecompressedSize() the assert is compiled
out similarly resulting in an unused variable.
zstd/lib/compress/zstd_compress_superblock.c:412:12: error: variable
'litLengthSum' set but not used [-Werror,-Wunused-but-set-variable]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The only zdb utility require to read metaslab-related data during
read-only pool import because of spacemaps validation. Add global
variable which will allow zdb read spacemaps in case of readonly
import mode.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes#9095Closes#12687
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks(). The updated pos variable was incorrectly
ignored resulting in the current offset being used.
This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13370Closes#13377
In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.
I am not sure it is expoitable, but better to be safe than sorry.
Reported-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
(cherry picked from commit 5352f85cddce44e82fb1c4caec3b333e3666d7fd)
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.
Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#13329Closes#13338
As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed
and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update
zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which
is sufficient. The behavior for prior kernels remains unchanged.
1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13294Closes#13297
FreeBSD's memory management system uses its own error numbers and gets
confused when these VOPs return EIO.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#13311
For legacy reasons, a couple of VOPs have to return error numbers that
don't come from the usual errno namespace. To handle the cases where
ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the
default error return value of EIO. Extend the macros to permit this.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes#13311
->readpages was removed and replaced by ->readahead. Define
zpl_readahead for kernels that don't have ->readpages.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Riccardo Schirone <rschirone91@gmail.com>
Closes#13278
blkdev.h includes genhd.h since dawn of upstream git, so this is
globally safe
Upstream-commit: 322cbb50de711814c42fb088f6d31901502c711a ("block:
remove genhd.h")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13251
bio_alloc(gfp_t gfp_mask, unsigned short nr_iovecs)
became
bio_alloc(struct block_device *bdev, unsigned short nr_vecs,
unsigned int opf, gfp_t gfp_mask)
passing NULL/0 continues previous behaviour
Upstream-commit: 07888c665b405b1cd3577ddebfeb74f4717a84c4 ("block:
pass a block_device and opf to bio_alloc")
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13251
NDF_ONLY_PNBUF has been removed from FreeBSD in favor of NDFREE_PNBUF.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#13277
Lustre makes light use of the zfs_refcount interfaces which
isn't a problem when using a non-debug build of OpenZFS. However,
when debugging is enabled the required symbols are not exported.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12613
Strict hole reporting was previously disabled by default as a
performance optimization. However, this has lead to confusion
over the expected behavior and a variety of workarounds being
adopted by consumers of ZFS. Change the default behavior to
always report holes and force the TXG sync.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Upstream-commit: 05b3eb6d23
Ref: #13261Closes#12746
This PR changes ZFS ACL checks to evaluate
fsuid / fsgid rather than euid / egid to avoid
accidentally granting elevated permissions to
NFS clients.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Andrew Walker <awalker@ixsystems.com>
Co-authored-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#13221
At shutdown time, we drain all of the zevents and set the
ZEVENT_SHUTDOWN flag. On FreeBSD, we may end up calling
zfs_zevent_destroy() after the zevent_lock has been destroyed while
the sysevent thread is winding down; we observe ESHUTDOWN, then back
out.
Events have already been drained, so just inline the kmem_free call in
sysevent_worker() to avoid the race, and document the assumption that
zfs_zevent_destroy doesn't do anything else useful at that point.
This fixes a panic that can occur at module unload time.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#13220
This is a direct commit to zfs-2.1-release to fix release builds that
error out on an unused variable. The issue is avoided on master by a
huge series of commits that change how the ASSERT macros work, but that
is not feasible to backport.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#13194Closes#13196