d_alias may need to be converted to du.d_alias
depending on the kernel version.
d_alias is currently in only one place in the code which
changes
"hlist_for_each_entry(dentry, &inode->i_dentry, d_alias)"
to
"hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)"
as neccesary.
This effectively results in a double macro expansion
for code that uses the zfs headers but already has its
own macro for just d_alias (lustre in this case).
Remove the conditional code for hlist_for_each_entry
and have a macro for "d_alias -> du.d_alias" instead.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gian-Carlo DeFazio <defazio1@llnl.gov>
Closes#14377
The bi_rw member of struct bio was renamed to bi_opf in Linux 6.2.
As well, Linux's implementation of bio_set_op_attrs(...) has been
removed.
The HAVE_BIO_BI_OPF macro already appears to be defined, but the
removal of the bio_set_op_attrs(...) implementation makes the build
fall back on the locally-defined implementation, which isn't updated
for the bio->bi_opf change. This commit adds that update.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#14324Closes#14331
Linux 6.2 renamed the get_acl() operation to get_inode_acl() in
the inode_operations struct. This should fix Issue #14323.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#14323Closes#14331
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14243
ARC code was many times significantly modified over the years, that
created significant amount of tangled and potentially broken code.
This should make arc_access()/arc_read() code some more readable.
- Decouple prefetch status tracking from b_refcnt. It made sense
originally, but became highly cryptic over the years. Move all the
logic into arc_access(). While there, clean up and comment state
transitions in arc_access(). Some transitions were weird IMO.
- Unify arc_access() calls to arc_read() instead of sometimes calling
it from arc_read_done(). To avoid extra state changes and checks add
one more b_refcnt for ARC_FLAG_IO_IN_PROGRESS.
- Reimplement ARC_FLAG_WAIT in case of ARC_FLAG_IO_IN_PROGRESS with
the same callback mechanism to not falsely account them as hits. Count
those as "iohits", an intermediate between "hits" and "misses". While
there, call read callbacks in original request order, that should be
good for fairness and random speculations/allocations/aggregations.
- Introduce additional statistic counters for prefetch, accounting
predictive vs prescient and hits vs iohits vs misses.
- Remove hash_lock argument from functions not needing it.
- Remove ARC_FLAG_PREDICTIVE_PREFETCH, since it should be opposite
to ARC_FLAG_PRESCIENT_PREFETCH if ARC_FLAG_PREFETCH is set. We may
wish to add ARC_FLAG_PRESCIENT_PREFETCH to few more places.
- Fix few false positive tests found in the process.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14123
We do a simple ifdef to avoid calling enable_kernel_spe()/
disable_kernel_spe() on PowerPC.
Reported-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14233Closes#14244
Linux defaults to setting "failfast" on BIOs, so that the OS will not
retry IOs that fail, and instead report the error to ZFS.
In some cases, such as errors reported by the HBA driver, not
the device itself, we would wish to retry rather than generating
vdev errors in ZFS. This new property allows that.
This introduces a per vdev option to disable the failfast option.
This also introduces a global module parameter to define the failfast
mask value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Submitted-by: Klara, Inc.
Closes#14056
Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping
infrastructure to support idmapped mounts of filesystems mounted
with an idmapping". Update the OpenZFS accordingly to improve the
idmapped mount support.
This pull request contains the following changes:
- xattr setter functions are fixed to take mnt_ns argument. Without
this, cp -p would fail for an idmapped mount in a user namespace.
- idmap_util is enhanced/fixed for its use in a user ns context.
- One test case added to test idmapped mount in a user ns.
Reviewed-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14097
Nothing consumes this definition so stop defining it.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14128
Check __riscv_xlen == 64 rather than _LP64 and define _LP64 if missing.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14128
`snprintf()` is meant to protect against buffer overflows, but operating
on the buffer using its return value, possibly by calling it again, can
cause a buffer overflow, because it will return how many characters it
would have written if it had enough space even when it did not. In a
number of places, we repeatedly call snprintf() by successively
incrementing a buffer offset and decrementing a buffer length, by its
return value. This is a potentially unsafe usage of `snprintf()`
whenever the buffer length is reached. CodeQL complained about this.
To fix this, we introduce `kmem_scnprintf()`, which will return 0 when
the buffer is zero or the number of written characters, minus 1 to
exclude the NULL character, when the buffer was too small. In all other
cases, it behaves like snprintf(). The name is inspired by the Linux and
XNU kernels' `scnprintf()`. The implementation was written before I
thought to look at `scnprintf()` and had a good name for it, but it
turned out to have identical semantics to the Linux kernel version.
That lead to the name, `kmem_scnprintf()`.
CodeQL only catches this issue in loops, so repeated use of snprintf()
outside of a loop was not caught. As a result, a thorough audit of the
codebase was done to examine all instances of `snprintf()` usage for
potential problems and a few were caught. Fixes for them are included in
this patch.
Unfortunately, ZED is one of the places where `snprintf()` is
potentially used incorrectly. Since using `kmem_scnprintf()` in it would
require changing how it is linked, we modify its usage to make it safe,
no matter what buffer length is used. In addition, there was a bug in
the use of the return value where the NULL format character was not
being written by pwrite(). That has been fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14098
The previous version reported all the right info, but the VERIFY3 name
made a little more confusing when looking for the matching location in
the source code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob N ★ <robn@despairlabs.com>
Closes#14099
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.
In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12209Closes#14070
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).
In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12209Closes#14070
This allows for much cleaner VERIFY-level assertions.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#14070
Open files, which aren't present in the snapshot, which is being
roll-backed to, need to disappear from the visible VFS image of
the dataset.
Kernel provides d_drop function to drop invalid entry from
the dcache, but inode can be referenced by dentry multiple dentries.
The introduced zpl_d_drop_aliases function walks and invalidates
all aliases of an inode.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#9600Closes#14070
Follow up for 4938d01d which changed zio_flag from enum to uint64_t.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#14100
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes#14086
The use of __noreturn__ in 55d7afa4ad on
spl_panic() caused objtool warnings on Linux when the kernel is built
with CONFIG_STACK_VALIDATION=y. This patch works around that by
restricting the application of __noreturn__ to builds for static
analyzers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14068
Adds support for idmapped mounts. Supported as of Linux 5.12 this
functionality allows user and group IDs to be remapped without changing
their state on disk. This can be useful for portable home directories
and a variety of container related use cases.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes#12923Closes#13671
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13984Closes#14004
On older kernels, the definition for `module_param_call()` typecasts
function pointers to `(void *)`, which triggers -Werror, causing the
check to return false when it should return true.
Fixing this breaks the build process on some older kernels because they
define a `__check_old_set_param()` function in their headers that checks
for a non-constified `->set()`. We workaround that through the c
preprocessor by defining `__check_old_set_param(set)` to `(set)`, which
prevents the build failures.
However, it is now apparent that all kernels that we support have
adopted the GRSecurity change, so there is no need to have an explicit
autotools check for it anymore. We therefore remove the autotools check,
while adding the workaround to our headers for the build time
non-constified `->set()` check done by older kernel headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13984Closes#14004
Both Clang's Static Analyzer and Synopsys' Coverity would ignore
assertions. Following Clang's advice, we annotate our assertions:
https://clang-analyzer.llvm.org/annotations.html#custom_assertions
This makes both Clang's Static Analyzer and Coverity properly identify
assertions. This change reduced Clang's reported defects from 246 to
180. It also reduced the false positives reported by Coverityi by 10,
while enabling Coverity to find 9 more defects that previously were
false negatives.
A couple examples of this would be CID-1524417 and CID-1524423. After
submitting a build to coverity with the modified assertions, CID-1524417
disappeared while the report for CID-1524423 no longer claimed that the
assertion tripped.
Coincidentally, it turns out that it is possible to more accurately
annotate our headers than the Coverity modelling file permits in the
case of format strings. Since we can do that and this patch annotates
headers whenever `__coverity_panic__()` would have been used in the
model file, we drop all models that use `__coverity_panic__()` from the
model file.
Upon seeing the success in eliminating false positives involving
assertions, it occurred to me that we could also modify our headers to
eliminate coverity's false positives involving byte swaps. We now have
coverity specific byteswap macros, that do nothing, to disable
Coverity's false positives when we do byte swaps. This allowed us to
also drop the byteswap definitions from the model file.
Lastly, a model file update has been done beyond the mentioned
deletions:
* The definitions of `umem_alloc_aligned()`, `umem_alloc()` andi
`umem_zalloc()` were originally implemented in a way that was
intended to inform coverity that when KM_SLEEP has been passed these
functions, they do not return NULL. A small error in how this was
done was found, so we correct it.
* Definitions for umem_cache_alloc() and umem_cache_free() have been
added.
In practice, no false positives were avoided by making these changes,
but in the interest of correctness from future coverity builds, we make
them anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13902
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event. This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#13797
Coverity found a bug in `zfs_secpolicy_create_clone()` where it is
possible for us to pass an unterminated string when `zfs_get_parent()`
returns an error. Upon inspection, it is clear that using `strlcpy()`
would have avoided this issue.
Looking at the codebase, there are a number of other uses of `strncpy()`
that are unsafe and even when it is used safely, switching to
`strlcpy()` would make the code more readable. Therefore, we switch all
instances where we use `strncpy()` to use `strlcpy()`.
Unfortunately, we do not portably have access to `strlcpy()` in
tests/zfs-tests/cmd/zfs_diff-socket.c because it does not link to
libspl. Modifying the appropriate Makefile.am to try to link to it
resulted in an error from the naming choice used in the file. Trying to
disable the check on the file did not work on FreeBSD because Clang
ignores `#undef` when a definition is provided by `-Dstrncpy(...)=...`.
We workaround that by explictly including the C file from libspl into
the test. This makes things build correctly everywhere.
We add a deprecation warning to `config/Rules.am` and suppress it on the
remaining `strncpy()` usage. `strlcpy()` is not portably avaliable in
tests/zfs-tests/cmd/zfs_diff-socket.c, so we use `snprintf()` there as a
substitute.
This patch does not tackle the related problem of `strcpy()`, which is
even less safe. Thankfully, a quick inspection found that it is used far
more correctly than strncpy() was used. A quick inspection did not find
any problems with `strcpy()` usage outside of zhack, but it should be
said that I only checked around 90% of them.
Lastly, some of the fields in kstat_t varied in size by 1 depending on
whether they were in userspace or in the kernel. The origin of this
discrepancy appears to be 04a479f706 where
it was made for no apparent reason. It conflicts with the comment on
KSTAT_STRLEN, so we shrink the kernel field sizes to match the userspace
field sizes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13876
There were never any users and it so happens the operation is not even
supported by rrm locks -- the macros were wrong for Linux and FreeBSD
when not using it's RMS locks.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#13906
Provides the missing full barrier variant to the membar primitive set.
While not used right now, this is probably going to change down the
road.
Name taken from Solaris, to follow the existing routines.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#13907
Add needed cpu feature tests for powerpc architecture.
Overview:
zfs_altivec_available() - needed by RAID-Z
zfs_vsx_available() - needed by BLAKE3
zfs_isa207_available() - needed by SHA2
Part 1 - Userspace
- use getauxval() for Linux and elf_aux_info() for FreeBSD
- direct including <sys/auxv.h> fails with double definitions
- so we self define the needed functions and definitions
Part 2 - Kernel space FreeBSD
- use exported cpu_features of <powerpc/cpu.h>
Part 3 - Kernel space Linux
- use cpu_has_feature() function of <asm/cpufeature.h>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13725
Replace ZFS_ENTER and ZFS_VERIFY_ZP, which have hidden returns, with
functions that return error code. The reason we want to do this is
because hidden returns are not obvious and had caused some missing fail
path unwinding.
This patch changes the common, linux, and freebsd parts. Also fixes
fail path unwinding in zfs_fsync, zpl_fsync, zpl_xattr_{list,get,set}, and
zfs_lookup().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#13831
We inherited membar_consumer() and membar_producer() from OpenSolaris,
but we had replaced membar_consumer() with Linux's smp_rmb() in
zfs_ioctl.c. The FreeBSD SPL consequently implemented a shim for the
Linux-only smp_rmb().
We reinstate membar_consumer() in platform independent code and fix the
FreeBSD SPL to implement membar_consumer() in a way analogous to Linux.
Reviewed-by: Konstantin Belousov <kib@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13843
In our codebase, `cond_resched() and `schedule()` are Linux kernel
functions that have replaced the OpenSolaris `kpreempt()` functions in
the codebase to such an extent that `kpreempt()` in zfs_context.h was
broken. Nobody noticed because we did not actually use it. The header
had defined `kpreempt()` as `yield()`, which works on OpenSolaris and
Illumos where `sched_yield()` is a wrapper for `yield()`, but that does
not work on any other platform.
The FreeBSD platform specific code implemented shims for these, but the
shim for `schedule()` forced us to wait, which is different than merely
rescheduling to another thread as the original Linux code does, while
the shim for `cond_resched()` had the same definition as its kernel
kpreempt() shim.
After studying this, I have concluded that we should reintroduce the
kpreempt() function in platform independent code with the following
definitions:
- In the Linux kernel:
kpreempt(unused) -> cond_resched()
- In the FreeBSD kernel:
kpreempt(unused) -> kern_yield(PRI_USER)
- In userspace:
kpreempt(unused) -> sched_yield()
In userspace, nothing changes from this cleanup. In the kernels, the
function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and
`cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on
FreeBSD and `schedule()` on Linux. This makes our behavior consistent
across platforms.
Note that Linux's SPL continues to use `cond_resched()` and
`schedule()`. However, those functions have been removed from both the
FreeBSD code and userspace code.
This should have the benefit of making it slightly easier to port the
code to new platforms by making how things should be mapped less
confusing.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#13845
Some ARM BSPs run the Android kernel, which has
a modified xattr_handler->get() function signature.
This adds support to compile against these kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Walter Huf <hufman@gmail.com>
Closes#13824
The 6.0 kernel added a printf-style var-arg for args > 0 to the
register_shrinker function, in order to add names to shrinkers, in
commit e33c267ab70de4249d22d7eab1cc7d68a889bac2. This enables the
shrinkers to have friendly names exposed in /sys/kernel/debug/shrinker/.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#13748
When building modules (as well as the kernel) with ARCH=um, the options
-Dsetjmp=kernel_setjmp and -Dlongjmp=kernel_longjmp are passed to the C
preprocessor for C files. This causes the setjmp and longjmp used in
module/lua/ldo.c to be kernel_setjmp and kernel_longjmp respectively in
the object file. However, the setjmp and longjmp that is intended to be
called is defined in an architecture dependent assembly file under the
directory module/lua/setjmp. Since it is an assembly and not a C file,
the preprocessor define is not given and the names do not change. This
becomes an issue when modpost is trying to create the Module.symvers
and sees no defined symbol for kernel_setjmp and kernel_longjmp. To fix
this, if the macro CONFIG_UML is defined, then setjmp and longjmp
macros are undefined.
When building with ARCH=um for x86 sub-architectures, CONFIG_X86 is not
defined. Instead, CONFIG_UML_X86 is defined. Despite this, the UML x86
sub-architecture can use the same object files as the x86 architectures
because the x86 sub-architecture UML kernel is running with the same
instruction set as CONFIG_X86. So the modules/Kbuild build file is
updated to add the same object files that CONFIG_X86 would add when
CONFIG_UML_X86 is defined.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Glenn Washburn <development@efficientek.com>
Closes#13547
This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes
As a user:
```
$ unshare -Um
$ zfs list
no datasets available
$ echo $$
1234
```
As root:
```
# zfs list
NAME ZONED MOUNTPOINT
containers off /containers
containers/host off /containers/host
containers/host/child off /containers/host/child
containers/host/child/gchild off /containers/host/child/gchild
containers/unpriv on /unpriv
containers/unpriv/child on /unpriv/child
containers/unpriv/child/gchild on /unpriv/child/gchild
# zfs zone /proc/1234/ns/user containers/unpriv
```
Back to the user namespace:
```
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
containers 129M 47.8G 24K /containers
containers/unpriv 128M 47.8G 24K /unpriv
containers/unpriv/child 128M 47.8G 128M /unpriv/child
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Buddy <https://buddy.works>
Closes#12263
When read and writing the UID/GID, we always want the value
relative to the root user namespace, the kernel will take care
of remapping this to the user namespace for us.
Calling from_kuid(user_ns, uid) with a unmapped uid will return -1
as that uid is outside of the scope of that namespace, and will result
in the files inside the namespace all being owned by 'nobody' and not
being allowed to call chmod or chown on them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#12263
Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver. blk-mq creates multiple request queues on
different CPUs rather than having a single request queue. This can
improve zvol performance with multithreaded reads/writes.
This implementation uses the blk-mq interfaces on 4.13 or newer
kernels. Building against older kernels will fall back to the
older BIO interfaces.
Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API. It is disabled by default.
In addition, this commit lets the zvol blk-mq layer process whole
`struct request` IOs at a time, rather than breaking them down
into their individual BIOs. This reduces dbuf lock contention
and overhead versus the legacy zvol submit_bio() codepath.
sequential dd to one zvol, 8k volblocksize, no O_DIRECT:
legacy submit_bio() 292MB/s write 453MB/s read
this commit 453MB/s write 885MB/s read
It also introduces a new `zvol_blk_mq_chunks_per_thread` module
parameter. This parameter represents how many volblocksize'd chunks
to process per each zvol thread. It can be used to tune your zvols
for better read vs write performance (higher values favor write,
lower favor read).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#13148
Issue #12483
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes#10058Closes#12918
As of the Linux 5.19 kernel the asm/fpu/internal.h header was
entirely removed. It has been effectively empty since the 5.16
kernel and provides no required functionality.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13529
As of the Linux 5.19 kernel the disk_*_io_acct() helper functions
have been replaced by the bdev_*_io_acct() functions.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function. The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function. The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13515
Still descend, but only once: we get a lot of mileage out of nodist_
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13316
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.
This commit deals with 2 cases:
- No page writeback is active. A WB_SYNC_ALL page writeback starts
and even completes. But when it's about to check if the PG_writeback
bit has been cleared, another writeback with WB_SYNC_NONE starts.
The sync page writeback ends up waiting for the non-sync page
writeback to complete.
- A page writeback with WB_SYNC_NONE is already active when a
WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
waiting for the WB_SYNC_NONE writeback.
The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes#12662Closes#12790
Originally it was thought it would be useful to split up the kmods
by functionality. This would allow external consumers to only load
what was needed. However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward. Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13274
Parts of the Linux kernel build system struggle with _Noreturn. This
results in the following warnings when building on RHEL 8.5, and likely
other environments. Switch to using the __attribute__((noreturn)).
warning: objtool: dbuf_free_range()+0x2b8:
return with modified stack frame
warning: objtool: dbuf_free_range()+0x0:
stack state mismatch: cfa1=7+40 cfa2=7+8
...
WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation
failed, symbol will not be versioned.
WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation
failed, symbol will not be versioned.
...
Additionally, __thread_exit() has been renamed spl_thread_exit() and
made a static inline function. This was needed because the kernel
will generate a warning for symbols which are __attribute__((noreturn))
and then exported with EXPORT_SYMBOL.
While we could continue to use _Noreturn in user space I've also
switched it to __attribute__((noreturn)) purely for consistency
throughout the code base.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13238
Commit 3b52ccd introduced a flaw where FSR and FSAVE are not restored
when using a Linux 5.16 kernel. These instructions are only used when
XSAVE is not supported by the processor meaning only some systems will
encounter this issue.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13210Closes#13236
This PR changes ZFS ACL checks to evaluate
fsuid / fsgid rather than euid / egid to avoid
accidentally granting elevated permissions to
NFS clients.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Andrew Walker <awalker@ixsystems.com>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#13221
Cleanup the kernel SIMD code by removing kernel dependencies.
- Replace XSTATE_XSAVE with our own XSAVE implementation for all
kernels not exporting kernel_fpu{begin,end}(), see #13059
- Replace union fpregs_state by a uint8_t * buffer and get the size
of the buffer from the hardware via the CPUID instruction
- Replace kernels xgetbv() by our own implementation which was
already there for userspace.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#13102
aarch64 is a different architecture than arm. Some
compilers might choke when both __arm__ and __aarch64__
are defined.
This change separates the checks for arm and for
aarch64 in the isa_defs.h header files.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Windel Bouwman <windel@windel.nl>
Closes#10335Closes#13151
A function that returns with no value is a different thing from a
function that doesn't return at all. Those are two orthogonal
concepts, commonly confused.
pthread_create(3) expects a pointer to a start routine that has a
very precise prototype:
void *(*start_routine)(void *);
However, other thread functions, such as kernel ones, expect:
void (*start_routine)(void *);
Providing a different one is incorrect, and has only been working
because the ABIs happen to produce a compatible function.
We should use '_Noreturn void', since it's the natural type, and
then provide a '_Noreturn void *' wrapper for pthread functions.
For consistency, replace most cases of __NORETURN or
__attribute__((noreturn)) by _Noreturn. _Noreturn is understood
by -std=gnu89, so it should be safe to use everywhere.
Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136
Ref: https://software.codidact.com/posts/285972
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Closes#13120
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#13097
Observed when building on CentOS 8 Stream. Remove the `out`
label at the end of the function and instead return.
linux/simd_x86.h: In function 'kfpu_begin':
linux/simd_x86.h:337:1: error: label at end of compound statement
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13089
To follow a change in illumos taskq
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#12802
Linux 5.16 moved XSTATE_XSAVE and XSTATE_XRESTORE out of our reach,
so add our own XSAVE{,OPT,S} code and use it for Linux 5.16.
Please note that this differs from previous behavior in that it
won't handle exceptions created by XSAVE an XRSTOR. This is sensible
for three reasons.
- Exceptions during XSAVE and XRSTOR can only occur if the feature
is not supported or enabled or the memory operand isn't aligned
on a 64 byte boundary. If this happens something else went
terribly wrong, and it may be better to stop execution.
- Previously we just printed a warning and didn't handle the fault,
this is arguable for the above reason.
- All other *SAVE instruction also don't handle exceptions, so this
at least aligns behavior.
Finally add a test to catch such a regression in the future.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#13042Closes#13059
69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1)
had a useful policy regarding
CALL(ARG1,
ARG2,
ARG3);
above 2 lines. As it stands, it spits out *both*
sysctl_os.c: 385: continuation line should be indented by 4 spaces
sysctl_os.c: 385: indent by spaces instead of tabs
which is very cool
Another >10 could be fixed by removing "ulong" &al. handling.
I don't foresee anyone actually using it intentionally
(does it even exist in modern headers? why did it in the first place?).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12993
This led to these two warning types:
debug.h:139:67: warning: the address of ‘ARC_anon’
will always evaluate as ‘true’ [-Waddress]
139 | #define ASSERT3P(x, y, z)
((void) sizeof (!!(x)), (void) sizeof (!!(z)))
| ^
arc.c:1591:2: note: in expansion of macro ‘ASSERT3P’
1591 | ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
| ^~~~~~~~
and
arc.h:66:44: warning: ‘<<’ in boolean context,
did you mean ‘<’? [-Wint-in-bool-context]
66 | #define HDR_GET_LSIZE(hdr)
((hdr)->b_lsize << SPA_MINBLOCKSHIFT)
debug.h:138:46: note: in definition of macro ‘ASSERT3U’
138 | #define ASSERT3U(x, y, z)
((void) sizeof (!!(x)), (void) sizeof (!!(z)))
| ^
arc.c:1760:12: note: in expansion of macro ‘HDR_GET_LSIZE’
1760 | ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
| ^~~~~~~~~~~~~
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13009
Linux decided to rename this for some reason. At some point, we
should probably invert this mapping, but for now...
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12975
sizeof(bitfield.member) is invalid, and this shows up in some FreeBSD
build configurations: work around this by !!ing ‒
this makes the sizeof target the ! result type (_Bool), instead
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Fixes: 42aaf0e ("libspl: ASSERT*: mark arguments as used")
Closes#12984Closes#12986
Evaluated every variable that lives in .data (and globals in .rodata)
in the kernel modules, and constified/eliminated/localised them
appropriately. This means that all read-only data is now actually
read-only data, and, if possible, at file scope. A lot of previously-
global-symbols became inlinable (and inlined!) constants. Probably
not in a big Wowee Performance Moment, but hey.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12899
Linux 5.16 moved these functions into this new header in commit
1b4fb8545f2b00f2844c4b7619d64d98440a477c. This change adds code to look
for the presence of this header, and include it so that the code using
xgetbv & xsetbv will compile again.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#12800
Commit https://github.com/torvalds/linux/commit/2e9bc346 moved
the elevator.h header under the block/ directory as part of some
refactoring. This turns out not to be a problem since there's
no longer anything we need from the header. This has been the
case for some time, this change removes the elevator.h include
and replaces it with a major.h include.
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12725
When using lseek(2) to report data/holes memory mapped regions of
the file were ignored. This could result in incorrect results.
To handle this zfs_holey_common() was updated to asynchronously
writeback any dirty mmap(2) regions prior to reporting holes.
Additionally, while not strictly required, the dn_struct_rwlock is
now held over the dirty check to prevent the dnode structure from
changing. This ensures that a clean dnode can't be dirtied before
the data/hole is located. The range lock is now also taken to
ensure the call cannot race with zfs_write().
Furthermore, the code was refactored to provide a dnode_is_dirty()
helper function which checks the dnode for any dirty records to
determine its dirtiness.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #11900Closes#12724
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths. Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default. To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.
Additional reading: https://lwn.net/Articles/794944/
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12441
Kernel commits
332f606b32b6 ovl: enable RCU'd ->get_acl()
0cad6246621b vfs: add rcu argument to ->get_acl() callback
Added compatibility code to detect the new ->get_acl() interface
and correctly handle the case where the new rcu argument is set.
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12548
Kernel commits
39f75da7bcc8 ("isystem: trim/fixup stdarg.h and other headers")
c0891ac15f04 ("isystem: ship and use stdarg.h")
564f963eabd1 ("isystem: delete global -isystem compile option")
(for now can be found in linux-next.git tree, will land into the
Linus' tree during the ongoing 5.15 cycle with one of akpm merges)
removed the -isystem flag and disallowed the inclusion of any
compiler header files. They also introduced a minimal
<linux/stdarg.h> as a replacement for <stdarg.h>.
include/os/linux/spl/sys/cmn_err.h in the ZFS source tree includes
<stdarg.h> unconditionally. Introduce a test for <linux/stdarg.h>
and include it instead of the compiler's one to prevent module
build breakage.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Closes#12531
The 5.15 kernel moved the backing_dev_info structure out of
the request queue structure which causes a build failure.
Rather than look in the new location for the BDI we instead
detect this upstream refactoring by the existance of either
the blk_queue_update_readahead() or disk_update_readahead()
functions. In either case, there's no longer any reason to
manually set the ra_pages value since it will be overridden
with a reasonable default (2x the block size) when
blk_queue_io_opt() is called.
Therefore, we update the compatibility wrapper to do nothing
for 5.9 and newer kernels. While it's tempting to do the
same for older kernels we want to keep the compatibility
code to preserve the existing behavior. Removing it would
effectively increase the default readahead to 128k.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12532
Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.
statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.
We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#8507
Instead of clearing stats inside arc_buf_alloc_impl() do it inside
arc_hdr_alloc() and arc_release(). It fixes statistics being wiped
every time a new dbuf is filled from the ARC.
Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits.
Since the hits are accounted under hash lock, replace atomics with
simple increments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12422
Move HAVE_LARGE_STACKS definitions to header and set when appropriate.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Kevin Bowling <kbowling@FreeBSD.org>
Closes#12350
prng32_bounded() is available to kernel only on FreeBSD 13+.
Call inline random_get_pseudo_bytes() with correct pointer type.
To be consistent, apply to Linux as well.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#12282
In all places except two spa_get_random() is used for small values,
and the consumers do not require well seeded high quality values.
Switch those two exceptions directly to random_get_pseudo_bytes()
and optimize spa_get_random(), renaming it to random_in_range(),
since it is not related to SPA or ZFS in general.
On FreeBSD directly map random_in_range() to new prng32_bounded() KPI
added in FreeBSD 13. On Linux and in user-space just reduce the type
used to uint32_t to avoid more expensive 64bit division.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12183
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.
From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.
This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12212
- Avoid atomic_add() when updating as_lower_bound/as_upper_bound.
Previous code was excessively strong on 64bit systems while not
strong enough on 32bit ones. Instead introduce and use real
atomic_load() and atomic_store() operations, just an assignments
on 64bit machines, but using proper atomics on 32bit ones to avoid
torn reads/writes.
- Reduce number of buckets on large systems. Extra buckets not as
much improve add speed, as hurt reads. Unlike wmsum for aggsum
reads are still important.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12145
wmsum counters are a reduced version of aggsum counters, optimized for
write-mostly scenarios. They do not provide optimized read functions,
but instead allow much cheaper add function. The primary usage is
infrequently read statistic counters, not requiring exact precision.
The Linux implementation is directly mapped into percpu_counter KPI.
The FreeBSD implementation is directly mapped into counter(9) KPI.
In user-space due to lack of better implementation mapped to aggsum.
Unfortunately neither Linux percpu_counter nor FreeBSD counter(9)
provide sufficient functionality to completelly replace aggsum, so
it still remains to be used for several hot counters.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12114
Just like #12087, the set_acl signature changed with all the bolted-on
*userns parameters, which disabled set_acl usage, and caused #12076.
Turn zpl_set_acl into zpl_set_acl and zpl_set_acl_impl, and add a
new configure test for the new version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#12076Closes#12093
Linux kernel commit 0f00b82e5413571ed225ddbccad6882d7ea60bc7 removes the
revalidate_disk() handler from struct block_device_operations. This
caused a regression, and this commit eliminates the call to it and the
assignment in the block_device_operations static handler assignment
code, when configure identifies that the kernel doesn't support that
API handler.
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11967Closes#11977
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11972
zfs_zevent_console committed multiple printk()s per line without
properly continuing them ‒ a single event could easily be fragmented
across over thirty lines, making it useless for direct application
zfs_zevent_cols exists purely to wrap the output from zfs_zevent_console
The niche this was supposed to fill can be better served by something
akin to the all-syslog ZEDLET
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#7082Closes#11996
Afterward, git grep ZoL matches:
* README.md: * [ZoL Site](https://zfsonlinux.org)
- Correct
* etc/default/zfs.in:# ZoL userland configuration.
- Changing this would induce a needless upgrade-check,
if the user has modified the configuration;
this can be updated the next time the defaults change
* module/zfs/dmu_send.c: * ZoL < 0.7 does not handle [...]
- Before 0.7 is ZoL, so fair enough
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Issue #11956
This change adds SIGSTOP and SIGTSTP handling to the issig function;
this mirrors its behavior on Solaris. This way, long running kernel
tasks can be stopped with the appropriate signals. Note that doing
so with ctrl-z on the command line doesn't return control of the tty
to the shell, because tty handling is done separately from stopping
the process. That can be future work, if people feel that it is a
necessary addition.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Issue #810
Issue #10843Closes#11801
Correct an assortment of typos throughout the code base.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11774
It used to be required to pass a enum km_type to kmap_atomic() and
kunmap_atomic(), however this is no longer necessary and the wrappers
zfs_k(un)map_atomic removed these. This is confusing in the ABD code as
the struct abd_iter member iter_km no longer exists and the wrapper
macros simply compile them out.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11768
In Linux 5.12, the filesystem API was modified to support ipmapped
mounts by adding a "struct user_namespace *" parameter to a number
functions and VFS handlers. This change adds the needed autoconf
macros to detect the new interfaces and updates the code appropriately.
This change does not add support for idmapped mounts, instead it
preserves the existing behavior by passing the initial user namespace
where needed. A subsequent commit will be required to add support
for idmapped mounted.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11712
zhold() wraps igrab() on Linux, and igrab() may fail when the inode
is in the process of being deleted. This means zhold() must only be
called when a reference exists and therefore it cannot be deleted.
This is the case for all existing consumers so add a VERIFY and a
comment explaining this requirement.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes#11704
This will allow platforms to implement it as they see fit, in particular
in a different manner than rrm locks.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11153
They are expected to fail only in corner cases.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11153
Add branch hints and constify the intermediate evaluations of
left/right params in VERIFY3*().
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Adam Moss <c@yotes.com>
Closes#11708
The bio_*_acct functions became GPL exports, which causes the
kernel modules to refuse to compile. This replaces code with
alternate function calls to the disk_*_io_acct interfaces, which
are not GPL exports. This change was added in kernel commit
99dfc43ecbf67f12a06512918aaba61d55863efc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11639
Making uio_impl.h the common header interface between Linux and FreeBSD
so both OS's can share a common header file. This also helps reduce code
duplication for zfs_uio_t for each OS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11622
zfs_znode_update_vfs is a more platform-agnostic name than
zfs_inode_update. Besides that, the function's prototype is moved to
include/sys/zfs_znode.h as the function is also used in common code.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ka Ho Ng <khng300@gmail.com>
Sponsored by: The FreeBSD Foundation
Closes#11580
This compatibility code is no longer needed. For it a while
iov_iter_init_compat() was used by zfs_uio_prefaultpages() but
this code should have been dropped as part of commit 83b91ae1.
Take care of that oversight and remove it.
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11543
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.
Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11438
As of 5.11 the blk_register_region() and blk_unregister_region()
functions have been retired. This isn't a problem since add_disk()
has implicitly allocated minor numbers for a very long time.
Reviewed-by: Rafael Kitover <rkitover@gmail.com>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11387Closes#11390
The generic IO accounting functions have been removed in favor of the
bio_start_io_acct() and bio_end_io_acct() functions which provide a
better interface. These new functions were introduced in the 5.8
kernels but it wasn't until the 5.11 kernel that the previous generic
IO accounting interfaces were removed.
This commit updates the blk_generic_*_io_acct() wrappers to provide
and interface similar to the updated kernel interface. It's slightly
different because for older kernels we need to pass the request queue
as well as the bio.
Reviewed-by: Rafael Kitover <rkitover@gmail.com>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11387Closes#11390
The lookup_bdev() function has been updated to require a dev_t
be passed as the second argument. This is actually pretty nice
since the major number stored in the dev_t was the only part we
were interested in. This allows to us avoid handling the bdev
entirely. The vdev_lookup_bdev() wrapper was updated to emulate
the behavior of the new lookup_bdev() for all supported kernels.
Reviewed-by: Rafael Kitover <rkitover@gmail.com>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11387Closes#11390
The CentOS stream 4.18.0-257 kernel appears to have backported
the Linux 5.9 change to make_request_fn and the associated API.
To maintain weak modules compatibility the original symbol was
retained and the new interface blk_alloc_queue_rh() was added.
Unfortunately, blk_alloc_queue() was replaced in the blkdev.h
header by blk_alloc_queue_bh() so there doesn't seem to be a way
to build new kmods against the old interfces. Even though they
appear to still be available for weak module binding.
To accommodate this a configure check is added for the new _rh()
variant of the function and used if available. If compatibility
code gets added to the kernel for the original blk_alloc_queue()
interface this should be fine. OpenZFS will simply continue to
prefer the new interface and only fallback to blk_alloc_queue()
when blk_alloc_queue_rh() isn't available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11374
As of the 5.10 kernel the generic splice compatibility code has been
removed. All filesystems are now responsible for registering a
->splice_read and ->splice_write callback to support this operation.
The good news is the VFS provided generic_file_splice_read() and
iter_file_splice_write() callbacks can be used provided the ->iter_read
and ->iter_write callback support pipes. However, this is currently
not the case and only iovecs and bvecs (not pipes) are ever attached
to the uio structure.
This commit changes that by allowing full iov_iter structures to be
attached to uios. Ever since the 4.9 kernel the iov_iter structure
has supported iovecs, kvecs, bvevs, and pipes so it's desirable to
pass the entire thing when possible. In conjunction with this the
uio helper functions (i.e uiomove(), uiocopy(), etc) have been
updated to understand the new UIO_ITER type.
Note that using the kernel provided uio_iter interfaces allowed the
existing Linux specific uio handling code to be simplified. When
there's no longer a need to support kernel's older than 4.9, then
it will be possible to remove the iovec and bvec members from the
uio structure and always use a uio_iter. Until then we need to
maintain all of the existing types for older kernels.
Some additional refactoring and cleanup was included in this change:
- Added checks to configure to detect available iov_iter interfaces.
Some are available all the way back to the 3.10 kernel and are used
when available. In particular, uio_prefaultpages() now always uses
iov_iter_fault_in_readable() which is available for all supported
kernels.
- The unused UIO_USERISPACE type has been removed. It is no longer
needed now that the uio_seg enum is platform specific.
- Moved zfs_uio.c from the zcommon.ko module to the Linux specific
platform code for the zfs.ko module. This gets it out of libzfs
where it was never needed and keeps this Linux specific code out
of the common sources.
- Removed unnecessary O_APPEND handling from zfs_iter_write(), this
is redundant and O_APPEND is already handled in zfs_write();
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11351
There is a tunable to select the fletcher 4 checksum implementation on
Linux but it was not present in FreeBSD.
Implement the sysctl handler for FreeBSD and use ZFS_MODULE_PARAM_CALL
to provide the tunable on both platforms.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11270
ZFS currently doesn't react to hotplugging cpu or memory into the
system in any way. This patch changes that by adding logic to the ARC
that allows the system to take advantage of new memory that is added
for caching purposes. It also adds logic to the taskq infrastructure
to support dynamically expanding the number of threads allocated to a
taskq.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#11212
There has been a panic affecting some system configurations where the
thread FPU context is disturbed during the fletcher 4 benchmarks,
leading to a panic at boot.
module_init() registers zcommon_init to run in the last subsystem
(SI_SUB_LAST). Running it as soon as interrupts have been configured
(SI_SUB_INT_CONFIG_HOOKS) makes sure we have finished the benchmarks
before we start doing other things.
While it's not clear *how* the FPU context was being disturbed, this
does seem to avoid it.
Add a module_init_early() macro to run zcommon_init() at this earlier
point on FreeBSD. On Linux this is defined as module_init().
Authored by: Konstantin Belousov <kib@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11302
The ZFS_ENTER/ZFS_EXIT/ZFS_VERFY_ZP macros should not be used
in the Linux zpl_*.c source files. They return a positive error
value which is correct for the common code, but not for the Linux
specific kernel code which expects a negative return value. The
ZPL_ENTER/ZPL_EXIT/ZPL_VERFY_ZP macros should be used instead.
Furthermore, the ZPL_EXIT macro has been updated to not call the
zfs_exit_fs() function. This prevents a possible deadlock which
can occur when a snapshot is automatically unmounted because the
zpl_show_devname() must never wait on in progress automatic
snapshot unmounts.
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11169Closes#11201
The field is yet another leftover from unsupported zfs_znode_move.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11186
Kernel 5.10 removed check_disk_change() in favor of callers using
the faster bdev_check_media_change() instead, and explicitly forcing
bdev revalidation when they desire that behavior. To preserve prior
behavior, I have wrapped this into a zfs_check_media_change() macro
that calls an inline function for the new API that mimics the old
behavior when check_disk_change() doesn't exist, and just calls
check_disk_change() if it exists.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#11085
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async
itself. It's sibling zvol_get_data already is platform-independent.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes#10979
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.
There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.
In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes#11142
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11125
The original xuio zero copy functionality has always been unused
on Linux and FreeBSD. Remove this disabled code to avoid any
confusion and improve readability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11124
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD. With a little refactoring they can be
moved to the common code which is what is done by this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11078
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux. Web links have been updated
to refer to https://github.com/openzfs/zfs. The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11007
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.
Add const in a few places where it makes sense.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#10997
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10890
== Motivation and Context
The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.
In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.
== Description
This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes#10932
nvlist does allow us to support different data types and systems.
To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.
Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes#10774
There are a number of places where cv_?_sig is used simply for
accounting purposes but the surrounding code has no ability to
cope with actually receiving a signal. On FreeBSD it is possible
to send signals to individual kernel threads so this could
enable undesirable behavior.
This patch adds routines on Linux that will do the same idle
accounting as _sig without making the task interruptible. On
FreeBSD cv_*_idle are all aliases for cv_*
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10843
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.
This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.
Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10839
Commit dcdc12e added compatibility code to treat NR_SLAB_RECLAIMABLE_B
as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value
is in bytes while the old value was in pages which means they are not
interchangeable.
The only place the reclaimable slab size is used is as a component of
the calculation done by arc_free_memory(). This function returns the
amount of memory the ARC considers to be free or reclaimable at little
cost. Rather than switch to a new interface to get this value it has
been removed it from the calculation. It is normally a minor component
compared to the number of inactive or free pages, and removing it
aligns the behavior with the FreeBSD version of arc_free_memory().
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Coleman Kane <ckane@colemankane.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10834
struct task_struct is needed for lockdep_off() in mutex.h
This has popped up after e616cb8daadf (in linux-5.7-rc7).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#10741
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.
KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM). The flag, module parameter, /proc entries, and associated
code are removed.
KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc(). The flag, /proc entries, and associated code are removed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10673
The make_request_fn and associated API was replaced recently in a
Linux 5.9 merge, to replace its functionality with a new submit_bio
member in struct block_device_operations.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#10696
This change appears to primarily be a name change for the enum. Had
to update the test logic so that it works so long as either one of
these is present (favoring the newer one). Additionally, as this is
newer, it only shows up in node_page_item, so this commit doesn't
test zone_page_item for the same enum.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes#10696
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes#10650
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes#10650
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes#10650
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes#10650
Remove dead code to make the implementation easier to understand.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Ahrens <matt@delphix.com>
Closes#10650
3442c2a02d added new `arc_wait_for_eviction` tracepoint, which fails to
compile, when tracepoints are enabled.
The tracepoint definition begins with `DEFINE_ARC_WAIT_FOR_EVICTION_EVENT`
and is a multi-line definition, so this fixes the backslash
and parenthesis accordingly.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#10669
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10600
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#10630
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10621
By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that
objects will be removed from kmem cache magazines by
`spl_kmem_cache_reap_now()`.
There is also a module parameter to change this to `KMC_EXPIRE_AGE`,
which establishes a maximum lifetime for objects to stay in the
magazine. This setting has rarely, if ever, been used, and is not
regularly tested.
This commit removes the code for `KMC_EXPIRE_AGE`, and associated module
parameters.
Additionally, the unused module parameter
`spl_kmem_cache_obj_per_slab_min` is removed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10608
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure. This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function. This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.
The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it. Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.
The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer. Therefore, additionally waking it from the
shrinker calbacks has little benefit.
The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second. Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times. Therefore, this mechanism consumes
significant amounts of CPU time.
The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
1. Return free objects from the magazine layer to the slab layer
2. Return entirely-free slabs to the page layer (i.e. free pages)
These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches. The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.
These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine. So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.
We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected. Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.
This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10576
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.
Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10511
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit." Two types of users are allowed to change
the limit:
1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`. This works properly.
2. A user with elevated system privileges (e.g. root). This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.
The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process. This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.
This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged. Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.
This problem only impacts Linux. Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8226Closes#10545
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems. In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems. For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.
Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.
These changes do not impact system behavior, they are purely code
cleanup. The changes are:
Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.
In `arc_available_memory()`:
* `desfree` is unused, remove it
* rename `freemem` to avoid conflict with pre-existing `#define`
* remove checks related to arenas
* use units of bytes, rather than converting from bytes to pages and
then back to bytes
`SPL_KMEM_CACHE_REAP` is unused, remove it.
`skc_reap` is unused, remove it.
The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.
`vmem_size()` and associated type and macros are unused, remove them.
In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10499
The kernel headers are installed for DKMS on linux, so don't install
them unless we're building on linux.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10506
The SPL provides a wrapper for the kernel's shrinker callbacks, which
enables the ZFS code to interface with multiple versions of the shrinker
API's from different kernel versions. Specifically, Linux kernels 3.0 -
3.11 has a single "combined" callback, and Linux kernels 3.12 and later
have two "split" callbacks. The SPL provides a wrapper function so that
the ZFS code only needs to implement one version of the callbacks.
Currently the SPL's wrappers are designed such that the ZFS code
implements the older, "combined" callback. There are a few downsides to
this approach:
* The general design within ZFS is for the latest Linux kernel to be
considered the "first class" API.
* The newer, "split" callback API is easier to understand, because each
callback has one purpose.
* The current wrappers do not completely abstract out the differing
API's, so ZFS code needs `#ifdef` code to handle the differing return
values required for different kernel versions.
This commit addresses these drawbacks by having the ZFS code provide the
latest, "split" callbacks, and the SPL provides a wrapping function for
the older, "combined" API.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10502
A previous commit enabled the tracking of object allocations
in Linux-backed caches from the SPL layer for debuggability.
The commit is: 9a170fc6fe54f1e852b6c39630fe5ef2bbd97c16
Unfortunately, it also introduced minor performance regressions
that were highlighted by the ZFS perf test-suite. Within Delphix
we found that the regression would be from -1%, all the way up
to -8% for some workloads.
This commit brings performance back up to par by creating a
separate counter for those caches and making it a percpu in
order to avoid lock-contention.
The initial performance testing was done by myself, and the
final round was conducted by @tonynguien who was also the one
that discovered the regression and highlighted the culprit.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#10397
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.
Add some files that were missing altogether in various Makefile's.
The changes to disted files in this commit (excluding deleted files):
+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10501