Commit Graph

4713 Commits

Author SHA1 Message Date
Rob Norris
21cba06bef
config: fix dequeue_signal check for kernels <4.20
Before 4.20, kernel_siginfo_t was just called siginfo_t. This was
causing the kthread_dequeue_signal_3arg_task check, which uses
kernel_siginfo_t, to fail on older kernels.

In d6b8c17f1, we started checking for the "new" three-arg
dequeue_signal() by testing for the "old" version. Because that test is
explicitly using kernel_siginfo_t, it would fail, leading to the build
trying to use the new three-arg version, which would then not compile.

This commit fixes that by avoiding checking for the old 3-arg
dequeue_signal entirely. Instead, we check for the new one, as well as
the 4-arg form, and we use the old form as a fallback. This way, we
never have to test for it explicitly, and once we're building
HAVE_SIGINFO will make sure we get the right kernel_siginfo_t for it, so
everything works out nice.

Original-patch-by: Finix <yancw@info2soft.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16666
2024-10-20 19:50:13 -07:00
Umer Saleem
27e8f56102
Fix inconsistent mount options for ZFS root
While mounting ZFS root during boot on Linux distributions from initrd,
mount from busybox is effectively used which executes mount system call
directly. This skips the ZFS helper mount.zfs, which checks and enables
the mount options as specified in dataset properties. As a result,
datasets mounted during boot from initrd do not have correct mount
options as specified in ZFS dataset properties.

There has been an attempt to use mount.zfs in zfs initrd script,
responsible for mounting the ZFS root filesystem (PR#13305). This was
later reverted (PR#14908) after discovering that using mount.zfs breaks
mounting of snapshots on root (/) and other child datasets of root have
the same issue (Issue#9461).

This happens because switching from busybox mount to mount.zfs correctly
parses the mount options but also adds 'mntpoint=/root' to the mount
options, which is then prepended to the snapshot mountpoint in
'.zfs/snapshot'. '/root' is the directory on Debian with initramfs-tools
where root filesystem is mounted before pivot_root. When Linux runtime
is reached, trying to access the snapshots on root results in
automounting the snapshot on '/root/.zfs/*', which fails.

This commit attempts to fix the automounting of snapshots on root, while
using mount.zfs in initrd script. Since the mountpoint of dataset is
stored in vfs_mntpoint field, we can check if current mountpoint of
dataset and vfs_mntpoint are same or not. If they are not same, reset
the vfs_mntpoint field with current mountpoint. This fixes the
mountpoints of root dataset and children in respective vfs_mntpoint
fields when we try to access the snapshots of root dataset or its
children. With correct mountpoint for root dataset and children stored
in vfs_mntpoint, all snapshots of root dataset are mounted correctly
and become accessible.

This fix will come into play only if current process, that is trying to
access the snapshots is not in chroot context. The Linux kernel API
that is used to convert struct path into char format (d_path), returns
the complete path for given struct path. It works in chroot environment
as well and returns the correct path from original filesystem root.

However d_path fails to return the complete path if any directory from
original root filesystem is mounted using --bind flag or --rbind flag
in chroot environment. In this case, if we try to access the snapshot
from outside the chroot environment, d_path returns the path correctly,
i.e. it returns the correct path to the directory that is mounted with
--bind flag. However inside the chroot environment, it only returns the
path inside chroot.

For now, there is not a better way in my understanding that gives the
complete path in char format and handles the case where directories from
root filesystem are mounted with --bind or --rbind on another path which
user will later chroot into. So this fix gets enabled if current
process trying to access the snapshot is not in chroot context.

With the snapshots issue fixed for root filesystem, using mount.zfs in
ZFS initrd script, mounts the datasets with correct mount options.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16646
2024-10-17 09:09:39 -04:00
Brian Behlendorf
c642e985e5
Revert "Temporarily disable Direct IO by default"
This partially reverts commit 41210597.  Now that b4e4cbeb2 has
been merged Direct IO can be enabled by default for Linux, but
for FreeBSD there still remains a potentially insufficient range
locking in zfs_getpages() which needs to be resolved.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16629
2024-10-12 13:51:35 -07:00
Brian Atkinson
b4e4cbeb20
Always validate checksums for Direct I/O reads
This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.

To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes
2. dio_verify_rd -> Checksum verification failure for reads.
This allows for determining what I/O operation was the culprit for the
checksum verification failure. All DIO errors are reported only on the
top-level VDEV.

Even though FreeBSD can write protect pages (stable pages) it still has
the same issue as Linux with Direct I/O reads.

This commit updates the following:
1. Propogates checksum failures for reads all the way up to the
   top-level VDEV.
2. Reports errors through zpool status -d as DIO.
3. Has two zevents for checksum verify errors with Direct I/O. One for
   read and one for write.
4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and
   handle ABD buffer contents validation the same as Linux.
5. Updated manipulate_user_buffer.c to also manipulate a buffer while a
   Direct I/O read is taking place.
6. Adds a new ZTS test case dio_read_verify that stress tests the new
   code.
7. Updated man pages.
8. Added an IMPLY statement to zio_checksum_verify() to make sure that
   Direct I/O reads are not issued as speculative.
9. Removed self healing through mirror, raidz, and dRAID VDEVs for
   Direct I/O reads.

This issue was first observed when installing a Windows 11 VM on a ZFS
dataset with the dataset property direct set to always. The zpool
devices would report checksum failures, but running a subsequent zpool
scrub would not repair any data and report no errors.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16598
2024-10-09 12:28:08 -07:00
JKDingwall
0b4dcbe5b4
Fix generation of kernel uevents for snapshot rename on linux
`zvol_rename_minors()` needs to be given the full path not just the
snapshot name.  Use code removed in a0bd735ad as a guide
to providing the necessary values.

Add ZTS check for /dev changes after snapshot rename.  After
renaming a snapshot with 'snapdev=visible' ensure that the /dev
entries are updated to reflect the rename.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Dingwall <james@dingwall.me.uk>
Closes #14223 
Closes #16600
2024-10-06 14:36:33 -07:00
Alexander Motin
4ebe674d91
ARC: Cache arc_c value during arc_evict()
Since arc_evict() run can take some time, arc_c change during it
may result in undesired shift in ARC states balance. Primarily in
case of arc_c reduction it may cause eviction from MFU data state
despite its being below the target already.  Instead we should
evict as originally planned and if needed do another round after.

Reviewed-by: Theera K. <tkittich@hotmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16576
Closes #16605
2024-10-04 10:56:43 -07:00
Pavel Snajdr
0d77e738e6
Defer resilver only when progress is above a threshold
Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).

The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.

But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #15810
2024-10-04 10:41:17 -07:00
Rob Norris
224393a321
feature: large_microzap
In a4b21eadec we added the zap_micro_max_size tuneable to raise the size
at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block)
ZAPs. Before this, a microZAP was limited to 128KiB, which was the old
largest block size. The side effect of raising the max size past 128KiB
is that it be stored in a large block, requiring the large_blocks
feature.

Unfortunately, this means that a backup stream created without the
--large-block (-L) flag to zfs send would split the microZAP block into
smaller blocks and send those, as is normal behaviour for large blocks.
This would be received correctly, but since microZAPs are limited to the
first block in the object by definition, the entries in the later blocks
would be inaccessible. For directory ZAPs, this gives the appearance of
files being lost.

This commit adds a feature flag, large_microzap, that must be enabled
for microZAPs to grow beyond 128KiB, and which will be activated the
first time that occurs. This feature is later checked when generating
the stream and if active, the send operation will abort unless
--large-block has also been requested.

Changing the limit still requires zap_micro_max_size to be changed. The
state of this flag effectively sets the upper value for this tuneable,
that is, if the feature is disabled, the tuneable will be clamped to
128KiB.

A stream flag is also added to ensure that the receiver also activates
its own feature flag upon receiving the stream. This is not strictly
necessary to _use_ the received microZAP, since it doesn't care how
large its block is, but it is required to send the microZAP object on,
otherwise the original problem occurs again.

Because it's difficult to reliably distinguish a microZAP from a fatZAP
from outside the ZAP code, and because it seems unlikely that most
users are affected (a fairly niche tuneable combined with what should be
an uncommon use of send), and for the sake of expediency, this change
activates the feature the first time a microZAP grows to use a large
block, and is never deactivated after that. This can be improved in the
future.

This commit changes nothing for existing pools that already have large
microZAPs. The feature will not be retroactively applied, but will be
activated the next time a microZAP grows past the limit.

Don't use large_blocks feature for enable/disable tests.  The
large_microzap depends on large_blocks, so it gets enabled as a
dependency, breaking the test. Instead use feature "longname", which has
the exact same feature characteristics.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16593
2024-10-02 20:47:11 -07:00
Brian Behlendorf
412105977c
Temporarily disable Direct IO by default
While some remaining issues are resolved with the recently merged
Direct IO functionality disable it by default.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16597
2024-10-02 18:24:29 -07:00
Brian Behlendorf
d34d4f97a8
snapdir: add 'disabled' value to make .zfs inaccessible
In some environments, just making the .zfs control dir hidden from sight
might not be enough. In particular, the following scenarios might
warrant not allowing access at all:
- old snapshots with wrong permissions/ownership
- old snapshots with exploitable setuid/setgid binaries
- old snapshots with sensitive contents

Introducing a new 'disabled' value that not only hides the control dir,
but prevents access to its contents by returning ENOENT solves all of
the above.

The new property value takes advantage of 'iuv' semantics ("ignore
unknown value") to automatically fall back to the old default value when
a pool is accessed by an older version of ZFS that doesn't yet know
about 'disabled' semantics.

I think that technically the zfs_dirlook change is enough to prevent
access, but preventing lookups and dir entries in an already opened .zfs
handle might also be a good idea to prevent races when modifying the
property at runtime.

Add zfs_snapshot_no_setuid parameter to control whether automatically
mounted snapshots have the setuid mount option set or not.

this could be considered a partial fix for one of the scenarios
mentioned in desired.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Co-authored-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #3963
Closes #16587
2024-10-02 09:12:02 -07:00
rilysh
86737c5927
Avoid computing strlen() inside loops
Compiling with -O0 (no proper optimizations), strlen() call
in loops for comparing the size, isn't being called/initialized
before the actual loop gets started, which causes n-numbers of
strlen() calls (as long as the string is). Keeping the length
before entering in the loop is a good idea.

On some places, even with -O2, both GCC and Clang can't
recognize this pattern, which seem to happen in an array
of char pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: rilysh <nightquick@proton.me>
Closes #16584
2024-10-02 09:10:06 -07:00
Rob Norris
0cf14bf4b5 Linux 6.12: PG_error flag was removed
torvalds/linux@09022bc196 removes the flag, and the corresponding
SetPageError() and ClearPageError() macros, with no replacement offered.

Going back through the upstream history, use of this flag has been
gradually removed over the last year as part of the long tail of
converting everything to folios. Interesting tidbit comments from
torvalds/linux@29e9412b25 and torvalds/linux@420e05d0de suggest that
this flag has not been used meaningfully since page writeback failures
started being recorded in errseq_t instead (the whole "fsyncgate" thing,
~2017, around torvalds/linux@8ed1e46aaf).

Given that, it's possible that since perhaps Linux 4.13 we haven't been
getting anything by setting the flag. I don't know if that's true and/or
if there's something we should be doing instead, but my gut feel is that
its probably fine we only use the page cache as a proxy to allow mmap()
to work, rather than backing IO with it.

As such, I'm expecting that removing this will do no harm, but I'm
leaving it in for older kernels to maintain status quo, and if there is
an overall better way, that is left for a future change.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16582
2024-10-01 13:54:05 -07:00
Rob Norris
d6b8c17f1d Linux 6.12: support 3arg dequeue_signal() without task param
See torvalds/linux@a2b80ce87a. It claims the task arg is always
`current`, and so it is with us, so this is a safe change to make. The
only spanner is that we also support the older pre-5.17 3-arg
dequeue_signal() which had different meaning, so we have to check the
types to get the right one.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16582
2024-10-01 13:53:50 -07:00
Sanjeev Bagewadi
20232ecfaa Support for longnames for files/directories (Linux part)
This patch adds the ability for zfs to support file/dir name up to 1023
bytes. This number is chosen so we can support up to 255 4-byte
characters. This new feature is represented by the new feature flag
feature@longname.

A new dataset property "longname" is also introduced to toggle longname
support for each dataset individually. This property can be disabled,
even if it contains longname files. In such case, new file cannot be
created with longname but existing longname files can still be looked
up.

Note that, to my knowledge native Linux filesystems don't support name
longer than 255 bytes. So there might be programs not able to work with
longname.

Note that NFS server may needs to use exportfs_get_name to reconnect
dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So
NFS may not work when longname is enabled.

Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even
though we add code to support it here, it won't actually work.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15921
2024-10-01 13:40:27 -07:00
Sanjeev Bagewadi
3cf2bfa570 Allocate zap_attribute_t from kmem instead of stack
This patch is preparatory work for long name feature. It changes all
users of zap_attribute_t to allocate it from kmem instead of stack. It
also make zap_attribute_t and zap_name_t structure variable length.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15921
2024-10-01 13:39:08 -07:00
Don Brady
141368a4b6
Restrict raidz faulted vdev count
Specifically, a child in a replacing vdev won't count when assessing
the dtl during a vdev_fault()

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16569
2024-10-01 09:12:11 -07:00
Rob Norris
c84a37ae93
lua: add flex array field to TString type
Linux 6.10+ with CONFIG_FORTIFY_SOURCE notices memcpy() accessing past
the end of TString, because it has no indication that there there may be
an additional allocation there.

There's no appropriate upstream change for this (ancient) version of
Lua, so this is the narrowest change I could come up with to add a flex
array field to the end of TString to satisfy the check. It's loosely
based on changes from lua/lua@ca41b43f and lua/lua@9514abc2.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16541
Closes #16583
2024-09-30 10:30:03 -07:00
Rob Norris
6f50f8e16b
zfs_log: add flex array fields to log record structs
ZIL log record structs (lr_XX_t) are frequently allocated with extra
space after the struct to carry variable-sized "payload" items.

Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime
bounds checking on memcpy() calls. Because these types had no indicator
that they might use more space than their simple definition,
__fortify_memcpy_chk will frequently complain about overruns eg:

    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)

To fix this, this commit adds flex array fields to all lr_XX_t structs
that require them, and then uses those fields to access that
end-of-struct area rather than more complicated casts and pointer
addition.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16501
Closes #16539
2024-09-27 09:18:11 -07:00
tstabrawa
b052035990
Avoid BUG in migrate_folio_extra
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio.  Call SetPagePrivate wherever
PageUptodate is set and define .release_folio, to cause
fallback_migrate_folio to wait for us.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tstabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes #15140
Closes #16568
2024-09-26 08:57:09 -07:00
Alexander Motin
48d1be254f
Properly release key in spa_keystore_dsl_key_hold_dd()
Since dsl_crypto_key_open() references the key, 0d23f5e2e4 should
have called dsl_crypto_key_rele() to drop it first instead of
calling dsl_crypto_key_free() directly.  The final result should
actually be the same, but without triggering dck_holds assertion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16567
2024-09-25 07:40:17 -07:00
Alexander Motin
832f66b218
FreeBSD: Sync taskq_cancel_id() returns with Linux
Couple places in the code depend on 0 returned only if the task was
actually cancelled.  Doing otherwise could lead to extra references
being dropped.  The race could be small, but I believe CI hit it
from time to time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16565
2024-09-24 16:29:18 -07:00
w0xel
ccc420acd5
Add missing guard defines for simd_stat
This adds the HAVE_KERNEL_NEON and HAVE_KERNEL_FPU_INTERNAL
guards to simd_stat.c defaulted to 0 to make it build again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Sebastian Wuerl <s.wuerl@mailbox.org>
Closes #16558
2024-09-24 09:07:26 -07:00
Theera K.
d40d40913d
Evicting too many bytes from MFU metadata
Without updating 'm' we evict from MFU metadata all that we wanted
to evict from all metadata, including already evicted MRU metadata
('m' is the total amount of metadata we had at the beginning,
and 'w' is the total amount of metadata we want to have). 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Theera K. <tkittich@hotmail.com>
Closes #16521
Closes #16546
2024-09-23 22:12:56 -07:00
Rob Norris
78e9e987e1 linux: log a scary warning when used with an experimental kernel
Since the person using the kernel may not be the person who built it,
show a warning at module load too, in case they aren't aware that it
might be weird.

Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15986
2024-09-23 10:44:54 -07:00
George Melikov
e419a63bf4
xattr dataset prop: change defaults to sa
It's the main recommendation to set xattr=sa
even in man pages, so let's set it by default.

xattr=sa don't use feature flag, so in the worst
case we'll have non-readable xattrs by other
non-openzfs platforms.

Non-overridden default `xattr` prop of existing pools
will automatically use `sa` after this commit too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #15147
2024-09-23 09:50:48 -07:00
Rich Ercolani
1d84c9eb66
Fix /proc/spl/kstat/simd on x86
Evidently while reworking it on aarch64, I broke it on x86 and
didn't notice.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #16556
2024-09-22 13:11:19 -07:00
Rob Norris
80645d6582
FreeBSD: restore zfs_znode_update_vfs()
I accidentally removed this in c22d56e3e, and didn't notice because it
doesn't fail the build, but does fail to load into the kernel because it
can't link it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16554
2024-09-21 10:03:54 -07:00
Brian Behlendorf
f9d4f1b480
Add SIMD metadata in /proc on Linux follow up
This change accidentally broke the FreeBSD build due to
a conflict between the simd_stat_init()/simd_stat_fini()
macros on FreeBSD and the extern function prototype.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16552
2024-09-20 15:48:12 -07:00
Rich Ercolani
5d01243964
Add SIMD metadata in /proc on Linux
Too many times, people's performance problems have amounted to
"somehow your SIMD support isn't working", and determining that
at runtime is difficult to describe to people.

This adds a /proc/spl/kstat/zfs/simd node, which exposes
metadata about which instructions ZFS thinks it can use,
on AArch64 and x86_64 Linux, to make investigating things
like this much easier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #16530
2024-09-20 08:16:44 -07:00
George Melikov
01852ffbf8 arc_hdr_authenticate: make explicit error
On compression we could be more explicit here for cases
where we can not recompress the data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:25:02 -07:00
George Melikov
b32d48a625 ZLE compression: don't use BPE_PAYLOAD_SIZE
ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:24:51 -07:00
George Melikov
522f2629c8 zio_compress: introduce max size threshold
Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.

New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.

Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.

If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:23:58 -07:00
Rob Norris
e8ede2ba78 zfs_debug: specific variant for userspace
Just nice and simple, with room to grow.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:49:50 -07:00
Rob Norris
c22d56e3ed zfs_znode: lift common code to a single shared file
For now, userspace has no znode implementation. Some of the property and
path handling code is used there though and is the same on all
platforms, so we only need a single copy of it.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:49:45 -07:00
Rob Norris
8fc0beb66b arc_os: split userspace and Linux kernel code
The Linux arc_os.c carries userspace and kernel code, with very little
overlap between the two. This lifts the userspace parts out into a
separate arc_os.c for libzpool and removes it from the Linux side.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:48:54 -07:00
Rob Norris
b7e43d6e7f linux/abd_os: remove kernel version check for compound page support
All kernels we support have compound pages that work the way we would
like. However, this code is new and this knowledge was hard won, so I'd
like to leave the description and option there for a little while, even
if it can only be disabled with a recompile.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16545
2024-09-19 15:45:05 -07:00
Rob Norris
a83762b3f4 linux: remove kernel version checks for unsupported kernels
Following 2b069768a (#16479), anything gated on a kernel version before
4.18 can be always included/excluded.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16545
2024-09-19 15:43:44 -07:00
Shengqi Chen
a877b39624 cityhash: replace invocations with specialized versions when possible
So that we can get actual benefit from last commit.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:19:17 -07:00
Shengqi Chen
0ae4460c61 zcommon: add specialized versions of cityhash4
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:18:59 -07:00
Shengqi Chen
1c35206124 dmu_objset: replace dnode_hash impl with cityhash4
As mentioned in PR #16131, replacing CRC-based hash with cityhash4
could slightly improve the performance by eliminating memory access.
Replacing algorightm is safe since the hash result is not persisted.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:18:12 -07:00
Rob Norris
f245541e24 zfs_file: implement zfs_file_deallocate for FreeBSD 14
FreeBSD 14 gained a `VOP_DEALLOCATE` VFS operation and a `fspacectl`
syscall to use it. At minimum, these zero the given region, and if the
underlying filesystem supports it, can make the region sparse. We can
use this to get TRIM-like behaviour for file vdevs.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16496
2024-09-18 11:35:48 -07:00
Rob Norris
fa330646b9 zfs_file: rename zfs_file_fallocate to zfs_file_deallocate
We only use it on a specific way: to punch a hole in (make sparse) a
region of a file, in order to implement TRIM-like behaviour.

So, call the op "deallocate", and move the Linux-style mode flags down
into the Linux implementation, since they're an implementation detail.

FreeBSD gets a no-op stub (for the moment).

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16496
2024-09-18 11:35:04 -07:00
Rob Norris
5df65ca9c1 config: remove HAVE_GET_USER_PAGES_*
get_user_pages_unlocked() had stabilised by 4.9.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
c57d268a78 config: remove HAVE_HAS_CAPABILITY
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
a298801426 config: remove HAVE_BIO_SET_DEV
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
0a61e51736 config: rework ZFS_GENHD_FL_*
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
df9795f2d7 config: remove HAVE_GENERIC_READLINK
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
f6661d1153 linux/zvol_os: convert END_IO macro to inline function
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
dcb8e5ec7c config: remove HAVE_BLK_MQ
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00
Rob Norris
de10132c34 config: remove HAVE_TMPFILE
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00