Commit Graph

262 Commits

Author SHA1 Message Date
ednadolski-ix
3bd4df3841
Improve ZFS objset sync parallelism
As part of transaction group commit, dsl_pool_sync() sequentially calls
dsl_dataset_sync() for each dirty dataset, which subsequently calls
dmu_objset_sync().  dmu_objset_sync() in turn uses up to 75% of CPU
cores to run sync_dnodes_task() in taskq threads to sync the dirty
dnodes (files).

There are two problems:

1. Each ZVOL in a pool is a separate dataset/objset having a single
   dnode.  This means the objsets are synchronized serially, which
   leads to a bottleneck of ~330K blocks written per second per pool.

2. In the case of multiple dirty dnodes/files on a dataset/objset on a
   big system they will be sync'd in parallel taskq threads. However,
   it is inefficient to to use 75% of CPU cores of a big system to do
   that, because of (a) bottlenecks on a single write issue taskq, and
   (b) allocation throttling.  In addition, if not for the allocation
   throttling sorting write requests by bookmarks (logical address),
   writes for different files may reach space allocators interleaved,
   leading to unwanted fragmentation.

The solution to both problems is to always sync no more and (if
possible) no fewer dnodes at the same time than there are allocators
the pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Edmund Nadolski <edmund.nadolski@ixsystems.com>
Closes #15197
2023-11-06 10:38:42 -08:00
Alexander Motin
799e09f75a
Unify arc_prune_async() code
There is no sense to have separate implementations for FreeBSD and
Linux.  Make Linux code shared as more functional and just register
FreeBSD-specific prune callback with arc_add_prune_callback() API.

Aside of code cleanup this should fix excessive pruning on FreeBSD:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274698

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15456
2023-10-30 16:56:04 -07:00
Thomas Bertschinger
97a0b5be50
Add mutex_enter_interruptible() for interruptible sleeping IOCTLs
Many long-running ZFS ioctls lock the spa_namespace_lock, forcing
concurrent ioctls to sleep for the mutex. Previously, the only
option is to call mutex_enter() which sleeps uninterruptibly. This
is a usability issue for sysadmins, for example, if the admin runs
`zpool status` while a slow `zpool import` is ongoing, the admin's
shell will be locked in uninterruptible sleep for a long time.

This patch resolves this admin usability issue by introducing
mutex_enter_interruptible() which sleeps interruptibly while waiting
to acquire a lock. It is implemented for both Linux and FreeBSD.

The ZFS_IOC_POOL_CONFIGS ioctl, used by `zpool status`, is changed to
use this new macro so that the command can be interrupted if it is
issued during a concurrent `zpool import` (or other long-running
operation).

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Thomas Bertschinger <bertschinger@lanl.gov>
Closes #15360
2023-10-26 09:17:40 -07:00
Tony Hutter
7c9b6fed16 zvol: Remove broken blk-mq optimization
This fix removes a dubious optimization in zfs_uiomove_bvec_rq()
that saved the iterator contents of a rq_for_each_segment().  This
optimization allowed restoring the "saved state" from a previous
rq_for_each_segment() call on the same uio so that you wouldn't
need to iterate though each bvec on every zfs_uiomove_bvec_rq() call.
However, if the kernel is manipulating the requests/bios/bvecs under
the covers between zfs_uiomove_bvec_rq() calls, then it could result
in corruption from using the "saved state".  This optimization
results in an unbootable system after installing an OS on a zvol
with blk-mq enabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #15351
2023-10-24 14:37:52 -07:00
Alexander Motin
96b9cf42e0
ARC: Remove b_bufcnt/b_ebufcnt from ARC headers
In most cases we do not care about exact number of buffers linked
to the header, we just need to know if it is zero, non-zero or one.
That can easily be checked just looking on b_buf pointer or in some
cases derefencing it.

b_ebufcnt is read only once, and in that case we already traverse
the list as part of arc_buf_remove(), so second traverse should not
be expensive.

This reduces L1 ARC header size by 8 bytes and full crypto header by
16 bytes, down to 176 and 232 bytes on FreeBSD respectively.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15350
2023-10-06 08:56:17 -07:00
Chunwei Chen
249d759caf
Fix invalid pointer access in trace_dbuf.h
In dnode_destroy, dn_objset is invalidated. However, it will later call
into dbuf_destroy, in which DTRACE_SET_STATE will try to access spa_name
via dn_objset causing illegal pointer access.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15333
2023-10-02 16:58:01 -07:00
Coleman Kane
01d00dfa9e Linux 6.6 compat: generic_fillattr has a new u32 request_mask added at arg2
In commit 0d72b92883c651a11059d93335f33d65c6eb653b, a new u32 argument
for the request_mask was added to generic_fillattr. This is the same
request_mask for statx that's present in the most recent API implemented
by zpl_getattr_impl. This commit conditionally adds it to the
zpl_generic_fillattr(...) macro, as well as the zfs_getattr_fast(...)
implementation, when configure determines it's present in the kernel's
generic_fillattr(...).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15263
2023-09-21 18:38:40 -07:00
Coleman Kane
b37f29341b Linux 6.6 compat: use inode_get/set_ctime*(...)
In Linux commit 13bc24457850583a2e7203ded05b7209ab4bc5ef, direct access
to the i_ctime member of struct inode was removed. The new approach is
to use accessor methods that exclusively handle passing the timestamp
around by value. This change adds new tests for each of these functions
and introduces zpl_* equivalents in include/os/linux/zfs/sys/zpl.h. In
where the inode_get/set_ctime*() functions exist, these zpl_* calls will
be mapped to the new functions. On older kernels, these macros just wrap
direct-access calls. The code that operated on an address of ip->i_ctime
to call ZFS_TIME_DECODE() now will take a local copy using
zpl_inode_get_ctime(), and then pass the address of the local copy when
performing the ZFS_TIME_DECODE() call, in all cases, rather than
directly accessing the member.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15263
Closes #15257
2023-09-21 18:38:31 -07:00
Dag-Erling Smørgrav
506fe78c48 Add VERIFY0P() and ASSERT0P() macros.
These macros are similar to VERIFY0() and ASSERT0() but are intended
for pointers, and therefore use uintptr_t instead of int64_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org>
Closes #15225
2023-09-19 17:21:55 -07:00
Dag-Erling Smørgrav
01d9283af3 Clean up existing VERIFY*() macros.
Chiefly:

- Remove unnecessary parentheses around variable names.
- Remove spaces between the type and variable in casts.
- Make the panic message for VERIFY0() reflect how the macro is used.
- Use %p to format pointers, except in Linux kernel code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Dag-Erling Smørgrav <des@FreeBSD.org>
Closes #15225
2023-09-19 17:21:01 -07:00
Mateusz Guzik
ee720ad7bc
Retire z_nr_znodes
Added in ab26409db7 ("Linux 3.1 compat, super_block->s_shrink"), with
the only consumer which needed the count getting retired in 066e825221
("Linux compat: Minimum kernel version 3.10").

The counter gets in the way of not maintaining the list to begin with.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #15274
2023-09-18 16:53:33 -07:00
Rich Ercolani
277f2e587b
Avoid save/restoring AMX registers to avoid a SPR erratum
Intel SPR erratum SPR4 says that if you trip into a vmexit while
doing FPU save/restore, your AMX register state might misbehave...
and by misbehave, I mean save all zeroes incorrectly, leading to
explosions if you restore it.

Since we're not using AMX for anything, the simple way to avoid
this is to just not save/restore those when we do anything, since
we're killing preemption of any sort across our save/restores.

If we ever decide to use AMX, it's not clear that we have any
way to mitigate this, on Linux...but I am not an expert.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14989
Closes #15168
2023-08-26 11:25:46 -07:00
Ryan Lahfa
fdb8fff916
linux/spl/kmem_cache: undefine kmem_cache_alloc before defining it
When compiling a kernel with bcachefs and zfs,
the two macros will collide, making it impossible
to have both filesystems.

It is sufficient to just undefine the macro before calling it.

On why this should be in ZFS rather than bcachefs, currently,
bcachefs is not a in-tree filesystem, but,
it has a reasonably high chance of getting included soon.

This avoids the breakage in ZFS early,
this patch may be distributed downstream in NixOS
and is already used there.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Closes #15144
2023-08-07 13:55:59 -07:00
Coleman Kane
6751634d77 Linux 4.20 compat: wrapper function for iov_iter type access
An iov_iter_type() function to access the "type" member of the struct
iov_iter was added at one point. Move the conditional logic to decide
which method to use for accessing it into a macro and simplify the
zpl_uio_init code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15100
2023-08-01 08:42:33 -07:00
Coleman Kane
325505e5c4 Linux 6.4 compat: iter_iov() function now used to get old iov member
The iov_iter->iov member is now iov_iter->__iov and must be accessed via
the accessor function iter_iov(). Create a wrapper that is conditionally
compiled to use the access method appropriate for the target kernel
version.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15100
2023-08-01 08:42:26 -07:00
Coleman Kane
43e8f6e37f
Linux 6.5 compat: blkdev changes
Multiple changes to the blkdev API were introduced in Linux 6.5. This
includes passing (void* holder) to blkdev_put, adding a new
blk_holder_ops* arg to blkdev_get_by_path, adding a new blk_mode_t type
that replaces uses of fmode_t, and removing an argument from the release
handler on block_device_operations that we weren't using. The open
function definition has also changed to take gendisk* and blk_mode_t, so
update it accordingly, too.

Implement local wrappers for blkdev_get_by_path() and
vdev_blkdev_put() so that the in-line calls are cleaner, and place the
conditionally-compiled implementation details inside of both of these
local wrappers. Both calls are exclusively used within vdev_disk.c, at
this time.

Add blk_mode_is_open_write() to test FMODE_WRITE / BLK_OPEN_WRITE
The wrapper function is now used for testing using the appropriate
method for the kernel, whether the open mode is writable or not.

Emphasize fmode_t arg in zvol_release is not used

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15099
2023-08-01 08:37:20 -07:00
Coleman Kane
3b8e318b77
Linux 6.5 compat: use disk_check_media_change when it exists
When disk_check_media_change() exists, then define
zfs_check_media_change() to simply call disk_check_media_change() on
the bd_disk member of its argument. Since disk_check_media_change()
is newer than when revalidate_disk was present in bops, we should
be able to safely do this via a macro, instead of recreating a new
implementation of the inline function that forces revalidation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15101
2023-08-01 08:32:38 -07:00
Rob Norris
6b0a4be5fe linux: implement filesystem-side copy/clone functions for EL7
Redhat have backported copy_file_range and clone_file_range to the EL7
kernel using an "extended file operations" wrapper structure. This
connects all that up to let cloning work there too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
2023-07-24 16:37:04 -07:00
Rob Norris
9927f219f1 linux: implement filesystem-side clone ioctls
Prior to Linux 4.5, the FICLONE etc ioctls were specific to BTRFS, and
were implemented as regular filesystem-specific ioctls. This implements
those ioctls directly in OpenZFS, allowing cloning to work on older
kernels.

There's no need to gate these behind version checks; on later kernels
Linux will simply never deliver these ioctls, instead calling the
approprate VFS op.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
2023-07-24 16:36:54 -07:00
Rob Norris
5a35c68b67 linux: implement filesystem-side copy/clone functions
This implements the Linux VFS ops required to service the file
copy/clone APIs:

  .copy_file_range    (4.5+)
  .clone_file_range   (4.5-4.19)
  .dedupe_file_range  (4.5-4.19)
  .remap_file_range   (4.20+)

Note that dedupe_file_range() and remap_file_range(REMAP_FILE_DEDUP) are
hooked up here, but are not implemented yet.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kay Pedersen <mail@mkwg.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Closes #15050
2023-07-24 16:36:38 -07:00
Coleman Kane
74f8ce4ca5
Linux 6.5 compat: disk_check_media_change() was added
The disk_check_media_change() function was added which replaces
bdev_check_media_change.  This change was introduced in 6.5rc1
444aa2c58cb3b6cfe3b7cc7db6c294d73393a894 and the new function takes a
gendisk* as its argument, no longer a block_device*. Thus, bdev->bd_disk
is now used to pass the expected data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15060
2023-07-20 09:09:25 -07:00
Coleman Kane
d3d63cac4d
Linux 6.5 compat: BLK_STS_NEXUS renamed to BLK_STS_RESV_CONFLICT
This change was introduced in Linux commit
7ba150834b840f6f5cdd07ca69a4ccf39df59a66

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15059
2023-07-14 16:33:51 -07:00
Coleman Kane
3a3e0d6fbc
intptr_t definition is canonically signed
Make the version here match that elsewhere in the kernel and system
headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #15058
2023-07-14 16:32:49 -07:00
Rich Ercolani
35a6247c5f
Add a delay to tearing down threads.
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938
2023-06-26 13:57:12 -07:00
Alexander Motin
70ea484e3e
Finally drop long disabled vdev cache.
It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result.  But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years.  These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.

Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14953
2023-06-09 12:40:55 -07:00
Richard Yao
0f03a41161
Use __attribute__((malloc)) on memory allocation functions
This informs the C compiler that pointers returned from these functions
do not alias other functions, which allows it to do better code
optimization and should make the compiled code smaller.

References:
https://stackoverflow.com/a/53654773
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
https://clang.llvm.org/docs/AttributeReference.html#malloc

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14827
2023-05-26 15:47:52 -07:00
Alexander Motin
b6fbe61fa6
zil: Add some more statistics.
In addition to a number of actual log bytes written, account also a
total written bytes including padding and total allocated bytes (bytes
<= write <= alloc).  It should allow to monitor zil traffic and space
efficiency.

Add dtrace probe for zil block size selection.

Make zilstat report more information and fit it into less width.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14863
2023-05-25 13:51:53 -07:00
Brian Behlendorf
012829df0c
Wrap clang specific pragma
Clang specific pragmas need to be wrapped to prevent a build
warning when compiling with gcc.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14814
2023-05-02 09:21:47 -07:00
Han Gao
6d59d5df98
Add loongarch64 support
Add loongarch64 definitions & lua module setjmp asm

LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Han Gao <gaohan@uniontech.com>
Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes #13422
2023-04-25 16:05:45 -07:00
Richard Yao
135d9a9048 Linux: Suppress -Wordered-compare-function-pointers in tracepoint code
Clang points out that there is a comparison against -1, but we cannot
fix it because that is from the kernel headers, which we must support.
We can workaround this by using a pragma.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738
2023-04-20 10:31:01 -07:00
youzhongyang
d4dc53dad2
Linux 6.3 compat: idmapped mount API changes
Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap 
type for mounts (struct mnt_idmap), we need to detect these changes 
and make zfs work with the new APIs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14682
2023-04-10 14:15:36 -07:00
youzhongyang
c5431f1465
linux 6.3 compat: needs REQ_PREFLUSH | REQ_OP_WRITE
Modify bio_set_flush() so if kernel version is >= 4.10, flags 
REQ_PREFLUSH and REQ_OP_WRITE are set together.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14695
2023-03-31 09:46:22 -07:00
Rich Ercolani
ae0b1f66c7
linux 6.3 compat: add another bdev_io_acct case
Linux 6.3+, and backports from it (6.2.8+), changed the
signatures on bdev_io_{start,end}_acct.  Add a case for it.  

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14658
Closes #14668
2023-03-27 11:29:19 -07:00
Attila Fülöp
5f3611121d
spl: cmn_err_once() should be usable in brace-less if else statements
Commit 11913870 (#14567) added cmn_err_once() by #define'ing a
compound statement but failed to consider usage in a single
statement brace-less if else.

Fix the problem by using the common "do {} while (0)" construct.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14629
2023-03-15 11:13:25 -07:00
WHR
d74f123045 Refactor CONFIG_SPE check on Linux/powerpc
Commit 5401472 adds a check to call enable_kernel_spe and
disable_kernel_spe only if CONFIG_SPE is defined. Refactor this check
in a way similar to what CONFIG_ALTIVEC and CONFIG_VSX are checked, in
order to remove redundant kfpu_begin() and kfpu_end() implementations.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14623
2023-03-15 10:30:42 -07:00
WHR
f31b0d4e88 Fix missing semicolons in commit 1f196e3
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14623
2023-03-15 10:30:02 -07:00
Richard Yao
47b994049f Silence clang static analyzer warnings about stored stack addresses
Clang's static analyzer complains that nvs_xdr() and nvs_native()
functions return pointers to stack memory. That is technically true, but
the pointers are stored in stack memory from the caller's stack frame,
are not read by the caller and are deallocated when the caller returns,
so this is harmless. We set the pointers to NULL to silence the
warnings.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14612
2023-03-14 15:25:01 -07:00
Pawel Jakub Dawidek
67a1b03791
Implementation of block cloning for ZFS
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #13392
2023-03-10 11:59:53 -08:00
Low-power
589f59b52a
Workaround for Linux PowerPC GPL-only cpu_has_feature()
Linux since 4.7 makes interface 'cpu_has_feature' to use jump labels on
powerpc if CONFIG_JUMP_LABEL_FEATURE_CHECKS is enabled, in this case
however the inline function references GPL-only symbol
'cpu_feature_keys'.

ZFS currently uses 'cpu_has_feature' either directly or indirectly from
several places; while it is unknown how this issue didn't break ZFS on
64-bit little-endian powerpc, it is known to break ZFS with many Linux
versions on both 32-bit and 64-bit big-endian powerpc.

Until this issue is fixed in Linux, we have to workaround it by
overriding affected inline functions without depending on
'cpu_feature_keys'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14590
2023-03-10 09:35:00 -08:00
Low-power
1f196e3107
Fix build for Linux/powerpc without CONFIG_ALTIVEC or CONFIG_VSX
This fixes building ZFS for Linux 4.7+ powerpc* architecture, where 
Linux was configured without CONFIG_ALTIVEC or CONFIG_VSX.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #14591
2023-03-07 14:06:52 -08:00
Attila Fülöp
1191387012
spl: Add cmn_err_once() to log a message only on the first call
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14567
2023-03-07 13:44:11 -08:00
Tino Reichardt
cef1253135 Add SHA2 SIMD feature tests for Linux
These are added:
- zfs_neon_available() for arm and aarch64
- zfs_sha256_available() for arm and aarch64
- zfs_sha512_available() for aarch64
- zfs_shani_available() for x86_64

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-Authored-By: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #13741
2023-03-02 13:52:04 -08:00
Tino Reichardt
3e254aaad0 Remove old or redundant SHA2 files
We had three sha2.h headers in different places.
The FreeBSD version, the Linux version and the generic solaris version.

The only assembly used for acceleration was some old x86-64 openssl
implementation for sha256 within the icp module.

For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS,
these files got removed also.

Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13741
2023-03-02 13:50:21 -08:00
Richard Yao
2f76797ad9
Linux: Assert mutex is held in mutex_exit()
A spurious mutex_exit() in a development branch caused weird issues
until I identified it. An assertion prior to mutex_exit() would have
caught it. Rather than adding assertions before invocations of
mutex_exit() in the code, let us simply add an assertion to
mutex_exit(). It is cheap and will likely improve developer
productivity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Closes #14541
2023-02-28 17:27:20 -08:00
Brian Behlendorf
89cd2197b9
Fix buffered/direct/mmap I/O race
When a page is faulted in for memory mapped I/O the page lock
may be dropped before it has been read and marked up to date.
If a buffered read encounters such a page in mappedread() it
must wait until the page has been updated. Failure to do so
will result in a panic on debug builds and incorrect data on
production builds.

The critical part of this change is in mappedread() where pages
which are not up to date are now handled. Additionally, it
includes the following simplifications.

- zfs_getpage() and zfs_fillpage() could be passed an array of
  pages. This could be more efficient if it was used but in
  practice only a single page was ever provided. These
  interfaces were simplified to acknowledge that.

- update_pages() was modified to correctly set the PG_error bit
  on a page when it cannot be read by dmu_read().

- Setting PG_error and PG_uptodate was moved to zfs_fillpage()
  from zpl_readpage_common(). This is consistent with the
  handling in update_pages() and mappedread().

- Minor additional refactoring to comments and variable
  declarations to improve readability.

- Add a test case to exercise concurrent buffered, direct,
  and mmap IO to the same file.

- Reduce the mmap_sync test case default run time.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13608 
Closes #14498
2023-02-23 10:57:24 -08:00
Brian Behlendorf
3fc92adc40
Linux: use filemap_range_has_page()
As of the 4.13 kernel filemap_range_has_page() can be used to
check if there is a page mapped in a given file range.  When
available this interface should be used which eliminates the
need for the zp->z_is_mapped boolean.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14493
2023-02-14 11:04:34 -08:00
Coleman Kane
9cd71c8604
linux 6.2 compat: zpl_set_acl arg2 is now struct dentry
Linux 6.2 changes the second argument of the set_acl operation to be a
"struct dentry *" rather than a "struct inode *". The inode* parameter
is still available as dentry->d_inode, so adjust the call to the _impl
function call to dereference and pass that pointer to it.

Also document that the get_acl -> get_inode_acl member name change from
commit 884a693 was an API change also introduced in Linux 6.2.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #14415
2023-01-24 11:20:50 -08:00
Attila Fülöp
037e4f2536 x86 asm: Replace .align with .balign
The .align directive used to align storage locations is
ambiguous. On some platforms and assemblers it takes a byte count,
on others the argument is interpreted as a shift value. The current
usage expects the first interpretation.

Replace it with the unambiguous .balign directive which always
expects a byte count, regardless of platform and assembler.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #14422
2023-01-24 09:04:39 -08:00
Jorgen Lundman
68c0771cc9
Unify Assembler files between Linux and Windows
Add new macro ASMABI used by Windows to change
calling API to "sysv_abi".

Reviewed-by: Attila Fülöp <attila@fueloep.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #14228
2023-01-17 11:09:19 -08:00
Richard Yao
2e7f664f04
Cleanup of dead code suggested by Clang Static Analyzer (#14380)
I recently gained the ability to run Clang's static analyzer on the
linux kernel modules via a few hacks. This extended coverage to code
that was previously missed since Clang's static analyzer only looked at
code that we built in userspace. Running it against the Linux kernel
modules built from my local branch produced a total of 72 reports
against my local branch. Of those, 50 were reports of logic errors and
22 were reports of dead code. Since we already had cleaned up all of
the previous dead code reports, I felt it would be a good next step to
clean up these dead code reports. Clang did a further breakdown of the
dead code reports into:

Dead assignment	15

Dead increment	2

Dead nested assignment	5

The benefit of cleaning these up, especially in the case of dead nested
assignment, is that they can expose places where our error handling is
incorrect. A number of them were fairly straight forward. However
several were not:

In vdev_disk_physio_completion(), not only were we not using the return
value from the static function vdev_disk_dio_put(), but nothing used it,
so I changed it to return void and removed the existing (void) cast in
the other area where we call it in addition to no longer storing it to a
stack value.

In FSE_createDTable(), the function is dead code. Its helper function
FSE_freeDTable() is also dead code, as are the CPP definitions in
`module/zstd/include/zstd_compat_wrapper.h`. We just delete it all.

In zfs_zevent_wait(), we have an optimization opportunity. cv_wait_sig()
returns 0 if there are waiting signals and 1 if there are none. The
Linux SPL version literally returns `signal_pending(current) ? 0 : 1)`
and FreeBSD implements the same semantics, we can just do
`!cv_wait_sig()` in place of `signal_pending(current)` to avoid
unnecessarily calling it again.

zfs_setattr() on FreeBSD version did not have error handling issue
because the code was removed entirely from FreeBSD version. The error is
from updating the attribute directory's files. After some thought, I
decided to propapage errors on it to userspace.

In zfs_secpolicy_tmp_snapshot(), we ignore a lack of permission from the
first check in favor of checking three other permissions. I assume this
is intentional.

In zfs_create_fs(), the return value of zap_update() was not checked
despite setting an important version number. I see no backward
compatibility reason to permit failures, so we add an assertion to catch
failures. Interestingly, Linux is still using ASSERT(error == 0) from
OpenSolaris while FreeBSD has switched to the improved ASSERT0(error)
from illumos, although illumos has yet to adopt it here. ASSERT(error ==
0) was used on Linux while ASSERT0(error) was used on FreeBSD since the
entire file needs conversion and that should be the subject of
another patch.

dnode_move()'s issue was caused by us not having implemented
POINTER_IS_VALID() on Linux. We have a stub in
`include/os/linux/spl/sys/kmem_cache.h` for it, when it really should be
in `include/os/linux/spl/sys/kmem.h` to be consistent with
Illumos/OpenSolaris. FreeBSD put both `POINTER_IS_VALID()` and
`POINTER_INVALIDATE()` in `include/os/freebsd/spl/sys/kmem.h`, so we
copy what it did.

Whenever a report was in platform-specific code, I checked the FreeBSD
version to see if it also applied to FreeBSD, but it was only relevant a
few times.

Lastly, the patch that enabled Clang's static analyzer to be run on the
Linux kernel modules needs more work before it can be put into a PR. I
plan to do that in the future as part of the on-going static analysis
work that I am doing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14380
2023-01-17 09:57:12 -08:00