Commit Graph

407 Commits

Author SHA1 Message Date
Rob Norris
6a28491f8e config: remove HAVE_3ARGS_BDI_SETUP_AND_REGISTER
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00
Rob Norris
d60d4ad809 config: remove HAVE_SET_CACHED_ACL_USABLE
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00
Rob Norris
6840e3b18b config: remove HAVE_SET_ACL
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00
Rob Norris
3d37b1d6d4 config: remove HAVE_POSIX_ACL_RELEASE and HAVE_POSIX_ACL_RELEASE_GPL_ONLY
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:50 -07:00
Rob Norris
67b0c883df config: remove HAVE___POSIX_ACL_CHMOD
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
583e2e25b9 config: remove HAVE_PERCPU_COUNTER_INIT_WITH_GFP
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
1b522c4583 config: remove HAVE_KERNEL_TIMER_LIST_FLAGS
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
eb230c789a config: remove HAVE_KERNEL_STRSCPY
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
6a1d8a9cf0 config: remove HAVE_FILE_INODE
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
257e40f9d9 config: remove HAVE_FILE_DENTRY
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
dc6af4a4b5 config: remove HAVE_D_MAKE_ROOT
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:49 -07:00
Rob Norris
efc293e371 config: remove HAVE_DENTRY_D_U_ALIASES
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:48 -07:00
Rob Norris
147c82bd5e config: remove HAVE_CLEAR_INODE and HAVE_EVICT_INODE
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:48 -07:00
Rob Norris
609559e5b9 config: remove HAVE_BIO_BVEC_ITER
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:45 -07:00
Rob Norris
233bed67a8 config: remove HAVE_1ARG_BIO_END_IO_T
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:40 -07:00
Pavel Snajdr
90af1e83e8
Linux 6.10 compat: Fix tracepoints definitions
__string field definition includes the source variable for a value
of the string when the TP hits; in 6.10+ kernels, __assign_str()
uses that to copy a value from src to the string, with older
kernels, __assign_str still accepted src as a second parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Tony Hutter <hutter2@llnl.gov>
Closes #16475 
Closes #16515
2024-09-17 13:38:02 -07:00
Brian Atkinson
a10e552b99
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
	  VDEV disks is guaranteed to not change. There is no concern
	  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
	the page contents while the write operation is occurring, data
	integrity can not be guaranteed. However, there is a module
	parameter `zfs_vdev_direct_write_verify` that controls the
	if a O_DIRECT writes that can occur to a top-level VDEV before
	a checksum verify is run before the contents of the I/O buffer
        are committed to disk. In the event of a checksum verification
	failure the write will return EIO. The number of O_DIRECT write
	checksum verification errors can be observed by doing
	`zpool status -d`, which will list all verification errors that
	have occurred on a top-level VDEV. Along with `zpool status`, a
	ZED event will be issues as `dio_verify` when a checksum
	verification error occurs.

ZVOLs and dedup is not currently supported with Direct I/O.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
	   the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
	   write/read IO requests when the O_DIRECT flag is used.
always   - Treats every write/read IO request as though it passed
           O_DIRECT and will do O_DIRECT if the alignment restrictions
	   are met otherwise will redirect through the ARC. This
	   property will not allow a request to fail.

There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the  direct dataset property is set to
disabled.

Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 13:47:59 -07:00
Rob Norris
5b9e695392 abd_os: break out platform-specific header parts
Removing the platform #ifdefs from shared headers in favour of
per-platform headers. Makes abd_t much leaner, among other things.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:18 -07:00
Rob Norris
db40fe4cf6 spl-taskq: per-taskq kstats
This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.

These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.

To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:50:35 -07:00
Rob Norris
3abffc8781 Linux 6.11: add compat macro for page_mapping()
Since the change to folios it has just been a wrapper anyway. Linux has
removed their wrapper, so we add one.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:47:18 -07:00
Rob Norris
7e98d30f46 Linux 6.11: get backing_dev_info through queue gendisk
It's no longer available directly on the request queue, but its easy to
get from the attached disk.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:46:49 -07:00
Rob Norris
e95b732e49 Linux 6.11: enable queue flush through queue limits
In 6.11 struct queue_limits gains a 'features' field, where, among other
things, flush and write-cache are enabled. Detect it and use it.

Along the way, the blk_queue_set_write_cache() compat wrapper gets a
little cleanup. Since both flags are alway set together, its now a
single bool. Also the very very ancient version that sets q->flush_flags
directly couldn't actually turn it off, so I've fixed that. Not that we
use it, but still.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16400
2024-08-13 17:46:41 -07:00
Rob Norris
6c82951d11
FreeBSD: remove support for FreeBSD < 13.0-RELEASE (#16372)
This includes the last 12.x release (now EOL) and 13.0 development
versions (<1300139).

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-08-05 16:56:45 -07:00
Alexander Motin
d4b5517ef9
Linux: Report reclaimable memory to kernel as such (#16385)
Linux provides SLAB_RECLAIM_ACCOUNT and __GFP_RECLAIMABLE flags to
mark memory allocations that can be freed via shinker calls.  It
should allow kernel to tune and group such allocations for lower
memory fragmentation and better reclamation under pressure.

This patch marks as reclaimable most of ARC memory, directly
evictable via ZFS shrinker, plus also dnode/znode/sa memory,
indirectly evictable via kernel's superblock shrinker.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
2024-07-30 11:40:47 -07:00
Jason Lee
41902c8e6d
Use kmap_local_page instead of kmap_atomic (#16329)
Changed zfs_k(un)map_atomic to zfs_k(un)map_local

Signed-off-by: Jason Lee <jasonlee@lanl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
2024-07-16 17:27:29 -07:00
Rob Norris
e951dba48a Linux 6.10: work harder to avoid kmem_cache_alloc reuse
Linux 6.10 change kmem_cache_alloc to be a macro, rather than a
function, such that the old #undef for it in spl-kmem-cache.c would
remove its definition completely, breaking the build.

This inverts the model used before. Rather than always defining the
kmem_cache_* macro, then undefining then inside spl-kmem-cache.c,
instead we make a special tag to indicate we're currently inside
spl-kmem-cache.c, and not defining those in macros in the first place,
so we can use the kernel-supplied kmem_cache_* functions to implement
spl_kmem_cache_*, as we expect.

For all other callers, we create the macros as normal and remove access
to the kernel's own conflicting names.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2024-07-15 17:10:02 -07:00
Daniel Berlin
f7d8b13336
Fix missing semicolon in trace_dbuf.h (#16281)
On fedora 40, on the 6.9.4 kernel (in updates-testing), assign_str
expands to a "do {<stuff> } while(0)" loop.  Without this semicolon,
the while(0) is unterminated, causing a cascade of useless errors.
With this semicolon, it compiles fine.  It also compiles fine on 6.8.11
(the previous kernel).  I have not tested earlier kernels than that, but
at worst it should add a pointless semicolon.

All other instances in the source tree are already terminated with
semicolons.

Signed-off-by: Daniel Berlin <dberlin@dberlin.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-07-12 17:44:10 -07:00
Mark Johnston
f72e081fbf
FreeBSD: Use a statement expression to implement SET_ERROR() (#16284)
This way we can avoid making assumptions about the SDT probe
implementation.  No functional change intended.

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-07-08 17:59:08 -07:00
Zhenlei Huang
e2357561b9
FreeBSD: Add const qualifier to members of struct opensolaris_utsname
These members have directly references to the global variables
exposed by the kernel. They are not going to be changed by this
kernel module.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Zhenlei Huang <zlei@FreeBSD.org>
Closes #16210
2024-05-30 09:58:20 -07:00
Pawel Jakub Dawidek
01c8efdd59
Simplify issig().
We always call it twice with JUSTLOOKING and then FORREAL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #16225
2024-05-29 10:49:11 -07:00
Rob N
708be0f415
Linux 6.7 compat: detect if kernel defines intptr_t
Since Linux 6.7 the kernel has defined intptr_t. Clang has
-Wtypedef-redefinition by default, which causes the build to fail
because we also have a typedef for intptr_t.

Since its better to use the kernel's if it exists, detect it and skip
our own.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16201
2024-05-24 18:54:24 -07:00
chenqiuhao1997
41ae864b69
Replace P2ALIGN with P2ALIGN_TYPED and delete P2ALIGN.
In P2ALIGN, the result would be incorrect when align is unsigned
integer and x is larger than max value of the type of align.
In that case, -(align) would be a positive integer, which means
high bits would be zero and finally stay zero after '&' when
align is converted to a larger integer type.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Qiuhao Chen <chenqiuhao1997@gmail.com>
Closes #15940
2024-05-10 08:47:21 -07:00
Rob N
8f1b7a6fa6
vdev_disk: disable flushes if device does not support it
If the underlying device doesn't have a write-back cache, the kernel
will just return a successful response. This doesn't hurt anything, but
it's extra work on the IO taskqs that are unnecessary. So, detect this
when we open the device for the first time.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16148
2024-05-02 15:18:35 -07:00
Rob N
b181b2e604
bdev_discard_supported: understand discard_granularity=0
Kernel documentation for the discard_granularity property says:

    A discard_granularity of 0 means that the device does not support
    discard functionality.

Some older kernels had drivers (notably loop, but also some USB-SATA
adapters) that would set the QUEUE_FLAG_DISCARD capability flag, but
have discard_granularity=0. Since 5.10 (torvalds/linux@b35fd7422c) the
discard entry point blkdev_issue_discard() has had a check for this,
which would immediately reject the call with EOPNOTSUPP, and throw a
scary diagnostic message into the log. See #16068.

Since 6.8, the block layer sets a non-zero default for
discard_granularity (torvalds/linux@3c407dc723), and a future kernel
will remove the check entirely[1].

As such, there's no good reason for us to enable discard when
discard_granularity=0. The kernel will never let the request go in
anyway; better that we just disable it so we can report it properly to
the user.

1. https://patchwork.kernel.org/project/linux-block/patch/20240312144826.1045212-2-hch@lst.de/

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16068
Closes #16082
2024-04-12 09:00:20 -07:00
Rob Norris
b613709c46 dkio: remove kernel dkio.h compatibility header
Without DKIOCFLUSHWRITECACHE, we no longer need the compat header. Note
that we're keeping the userspace SPL compat header, which is used by
libefi.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
2024-04-11 17:17:18 -07:00
Rob Norris
c9c838aa1f zio: remove io_cmd and DKIOCFLUSHWRITECACHE
There's no other options, so we can just always assume its a flush.

Includes some light refactoring where a switch statement was doing
control flow that no longer works.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16064
2024-04-11 17:17:11 -07:00
Rich Ercolani
e5e2a5a3b8
Add custom debug printing for your asserts
Being able to print custom debug information on assert trip
seems useful.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #15792
2024-04-10 13:30:25 -07:00
Rob N
a9a4290173
xdr: header cleanup
#16047 notes that include/os/freebsd/spl/rpc/xdr.h carried an
(apparently) incompatible license. While looking into it, it seems that
this file is actually unnecessary these days - FreeBSD's kernel XDR has
XDR_CONTROL, xdrmem_control and XDR_GET_BYTES_AVAIL, while userspace has
XDR_CONTROL and xdrmem_control, and our implementation of
XDR_GET_BYTES_AVAIL for libspl works nicely with it. So this removes
that file outright.

To keep the includes in nvpair.c tidy, I've made a few small adjustments
to the Linux headers. By definition, rpc/types.h provides bool_t and is
included before rpc/xdr.h, so I've created rpc/types.h for Linux. This
isn't necessary for userspace; both FreeBSD native and tirpc on Linux
already have these headers set up correctly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16047 
Closes #16051
2024-04-03 15:13:27 -07:00
Rob Norris
06a196020e vdev_disk: rewrite BIO filling machinery to avoid split pages
This commit tackles a number of issues in the way BIOs (`struct bio`)
are constructed for submission to the Linux block layer.

The kernel has a hard upper limit on the number of pages/segments that
can be added to a BIO, as well as a separate limit for each device
(related to its queue depth and other scheduling characteristics).

ZFS counts the number of memory pages in the request ABD
(`abd_nr_pages_off()`, and then uses that as the number of segments to
put into the BIO, up to the hard upper limit. If it requires more than
the limit, it will create multiple BIOs.

Leaving aside the fact that page count method is wrong (see below), not
limiting to the device segment max means that the device driver will
need to split the BIO in half. This is alone is not necessarily a
problem, but it interacts with another issue to cause a much larger
problem.

The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
`struct page` pointer, and offset+len within it. `struct page` can
represent a run of contiguous memory pages (known as a "compound page").
In can be of arbitrary length.

The ZFS functions that count ABD pages and load them into the BIO
(`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
page` is for multiple pages. In this case, it will load the same `struct
page` into the BIO multiple times, with the offset adjusted each time.

With a sufficiently large ABD, this can easily lead to the BIO being
entirely filled much earlier than it could have been. This is also
further contributes to the problem caused by the incorrect segment limit
calculation, as its much easier to go past the device limit, and so
require a split.

Again, this is not a problem on its own.

The logic for "never submit more than `PAGE_SIZE`" is actually a little
more subtle. It will actually never submit a buffer that crosses a 4K
page boundary.

In practice, this is fine, as most ABDs are scattered, that is a list of
complete 4K pages, and so are loaded in as such.

Linear ABDs are typically allocated from slabs, and for small sizes they
are frequently not aligned to page boundaries. For example, a 12K
allocation can span four pages, eg:

     -- 4K -- -- 4K -- -- 4K -- -- 4K --
    |        |        |        |        |
          :## ######## ######## ######:    [1K, 4K, 4K, 3K]

Such an allocation would be loaded into a BIO as you see:

    [1K, 4K, 4K, 3K]

This tends not to be a problem in practice, because even if the BIO were
filled and needed to be split, each half would still have either a start
or end aligned to the logical block size of the device (assuming 4K at
least).

---

In ideal circumstances, these shortcomings don't cause any particular
problems. Its when they start to interact with other ZFS features that
things get interesting.

Aggregation will create a "gang" ABD, which is simply a list of other
ABDs. Iterating over a gang ABD is just iterating over each ABD within
it in turn.

Because the segments are simply loaded in order, we can end up with
uneven segments either side of the "gap" between the two ABDs. For
example, two 12K ABDs might be aggregated and then loaded as:

    [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]

Should a split occur, each individual BIO can end up either having an
start or end offset that is not aligned to the logical block size, which
some drivers (eg SCSI) will reject. However, this tends not to happen
because the default aggregation limit usually keeps the BIO small enough
to not require more than one split, and most pages are actually full 4K
pages, so hitting an uneven gap is very rare anyway.

If the pool is under particular memory pressure, then an IO can be
broken down into a "gang block", a 512-byte block composed of a header
and up to three block pointers. Each points to a fragment of the
original write, or in turn, another gang block, breaking the original
data up over and over until space can be found in the pool for each of
them.

Each gang header is a separate 512-byte memory allocation from a slab,
that needs to be written down to disk. When the gang header is added to
the BIO, its a single 512-byte segment.

Pulling all this together, consider a large aggregated write of gang
blocks. This results a BIO containing lots of 512-byte segments. Given
our tendency to overfill the BIO, a split is likely, and most possible
split points will yield a pair of BIOs that are misaligned. Drivers that
care, like the SCSI driver, will reject them.

---

This commit is a substantial refactor and rewrite of much of `vdev_disk`
to sort all this out.

`vdev_bio_max_segs()` now returns the ideal maximum size for the device,
if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
override this, to assist with testing.

We scan the ABD up front to count the number of pages within it, and to
confirm that if we submitted all those pages to one or more BIOs, it
could be split at any point with creating a misaligned BIO.  If the
pages in the BIO are not usable (as in any of the above situations), the
ABD is linearised, and then checked again. This is the same technique
used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size
and allocator quirks.

`vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
idea is simply that it can hold all the state needed to create, submit
and return multiple BIOs, including all the refcounts, the ABD copy if
it was needed, and so on. Apart from what I hope is a clearer interface,
the major difference is that because we know how many BIOs we'll need up
front, we don't need the old overflow logic that would grow the BIO
array, throw away all the old work and restart. We can get it right from
the start.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
2024-03-25 16:51:14 -07:00
Rob Norris
df04efe321 linux 5.4 compat: page_size()
Before 5.4 we have to do a little math.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15533
Closes #15588
2024-03-25 16:48:15 -07:00
Rob N
90ff732358
freebsd: fix missing headers in distribution tarball
arc_os.h and freebsd_event.h aren't included in release tarballs, so the
build fails on FreeBSD. This fixes it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #15963
2024-03-20 10:08:50 -07:00
Alexander Motin
e0bd8118d0
Linux: Cleanup taskq threads spawn/exit
This changes taskq_thread_should_stop() to limit maximum exit rate
for idle threads to one per 5 seconds.  I believe the previous one
was broken, not allowing any thread exits for tasks arriving more
than one at a time and so completing while others are running.

Also while there:
 - Remove taskq_thread_spawn() calls on task allocation errors.
 - Remove extra taskq_thread_should_stop() call.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15873
2024-02-13 11:15:16 -08:00
Brian Behlendorf
6dccdf501e
BRT: Fix FICLONE/FICLONERANGE shortened copy
On Linux the ioctl_ficlonerange() and ioctl_ficlone() system calls
are expected to either fully clone the specified range or return an
error.  The range may be for an entire file.  While internally ZFS
supports cloning partial ranges there's no way to return the length
cloned to the caller so we need to make this all or nothing.

As part of this change support for the REMAP_FILE_CAN_SHORTEN flag
has been added.  When REMAP_FILE_CAN_SHORTEN is set zfs_clone_range()
will return a shortened range when encountering pending dirty records.
When it's clear zfs_clone_range() will block and wait for the records
to be written out allowing the blocks to be cloned.

Furthermore, the file range lock is held over the region being cloned
to prevent it from being modified while cloning.  This doesn't quite
provide an atomic semantics since if an error is encountered only a
portion of the range may be cloned.  This will be converted to an
error if REMAP_FILE_CAN_SHORTEN was not provided and returned to the
caller.  However, the destination file range is left in an undefined
state.

A test case has been added which exercises this functionality by
verifying that `cp --reflink=never|auto|always` works correctly.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #15728
Closes #15842
2024-02-05 16:44:45 -08:00
Rob Norris
2e6b3c4d94 Linux 6.8 compat: handle mnt_idmap user_namespace change
struct mnt_idmap no longer has a struct user_namespace within it. Work
around this by creating a temporary with the copy of the map we need
taken from the idmap.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805
2024-01-29 11:36:07 -08:00
Rob Norris
84980ee0e6 Linux 6.8 compat: implement strlcpy fallback
Linux has removed strlcpy in favour of strscpy. This implements a
fallback implementation of strlcpy for this case.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15805
2024-01-29 11:36:07 -08:00
MigeljanImeri
78e8c1f844
Remove list_size struct member from list implementation
Removed the list_size struct member as it was only used in a single
assertion, as mentioned in PR #15478.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: MigeljanImeri <imerimigel@gmail.com>
Closes #15812
2024-01-26 14:46:42 -08:00
Alexander Motin
9e0d12e310
ZIL: Update Linux tracing after #15635
While picking parts from #14909 I've missed Linux tracing specific
ones, that went unnoticed in default configurations, but breaks the
build in some.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15730
2024-01-08 16:49:39 -08:00
Shengqi Chen
7b89149c9f
Linux 6.2 compat: add check for kernel_neon_* availability
This patch adds check for `kernel_neon_*` symbols on arm and arm64
platforms to address the following issues:

1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so
   license compatibility must be checked before use.
2. On both arm and arm64, the definitions of these symbols are guarded
   by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still
   present. Checking in configuration phase only leads to MODPOST
   errors (undefined references).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #15711 
Closes #14555 
Closes: #15401
2024-01-08 16:05:24 -08:00
Rob N
6930ecbb75
spa: make read/write queues configurable
We are finding that as customers get larger and faster machines
(hundreds of cores, large NVMe-backed pools) they keep hitting
relatively low performance ceilings. Our profiling work almost always
finds that they're running into bottlenecks on the SPA IO taskqs.
Unfortunately there's often little we can advise at that point, because
there's very few ways to change behaviour without patching.

This commit adds two load-time parameters `zio_taskq_read` and
`zio_taskq_write` that can configure the READ and WRITE IO taskqs
directly.

This achieves two goals: it gives operators (and those that support
them) a way to tune things without requiring a custom build of OpenZFS,
which is often not possible, and it lets us easily try different config
variations in a variety of environments to inform the development of
better defaults for these kind of systems.

Because tuning the IO taskqs really requires a fairly deep understanding
of how IO in ZFS works, and generally isn't needed without a pretty
serious workload and an ability to identify bottlenecks, only minimal
documentation is provided. Its expected that anyone using this is going
to have the source code there as well.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15675
2023-12-20 14:17:14 -08:00
Rob Norris
957dc1037a Linux 6.7 compat: rework shrinker setup for heap allocations
6.7 changes the shrinker API such that shrinkers must be allocated
dynamically by the kernel. To accomodate this, this commit reworks
spl_register_shrinker() to do something similar against earlier kernels.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://github.com/sponsors/robn
Closes #15681
2023-12-20 11:47:55 -08:00