Commit Graph

319 Commits

Author SHA1 Message Date
Alan Somers
e197bb24f1
Optionally skip zil_close during zvol_create_minor_impl
If there were no zil entries to replay, skip zil_close.  zil_close waits
for a transaction to sync.  That can take several seconds, for example
during pool import of a resilvering pool.  Skipping zil_close can cut
the time for "zpool import" from 2 hours to 45 seconds on a resilvering
pool with a thousand zvols.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Sponsored-by: Axcient
Closes #13999 
Closes #14015
2022-11-08 12:38:08 -08:00
youzhongyang
f224eddf92
Support idmapped mount in user namespace
Linux 5.17 commit torvalds/linux@5dfbfe71e enables "the idmapping 
infrastructure to support idmapped mounts of filesystems mounted 
with an idmapping". Update the OpenZFS accordingly to improve the 
idmapped mount support. 

This pull request contains the following changes:

- xattr setter functions are fixed to take mnt_ns argument. Without
  this, cp -p would fail for an idmapped mount in a user namespace.
- idmap_util is enhanced/fixed for its use in a user ns context.
- One test case added to test idmapped mount in a user ns.

Reviewed-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14097
2022-11-08 10:28:56 -08:00
Serapheim Dimitropoulos
f66ffe6878
Expose zfs_vdev_open_timeout_ms as a tunable
Some of our customers have been occasionally hitting zfs import failures
in Linux because udevd doesn't create the by-id symbolic links in time
for zpool import to use them. The main issue is that the
systemd-udev-settle.service that zfs-import-cache.service and other
services depend on is racy. There is also an openzfs issue filed (see
https://github.com/openzfs/zfs/issues/10891) outlining the problem and
potential solutions.

With the proper solutions being significant in terms of complexity and
the priority of the issue being low for the time being, this patch
exposes `zfs_vdev_open_timeout_ms` as a tunable so people that are
experiencing this issue often can increase it as a workaround.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14133
2022-11-03 15:02:46 -07:00
Richard Yao
11e3416ae7
Cleanup: Remove branches that always evaluate the same way
Coverity reported that the ASSERT in taskq_create() is always true and
the `*offp > MAXOFFSET_T` check in zfs_file_seek() is always false.

We delete them as cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14130
2022-11-03 10:47:48 -07:00
Brooks Davis
d96303cb07 acl: use uintptr_t for ace walker cookies
Avoid assuming that a pointer can fit in a uint64_t and use uintptr_t
instead.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes #14131
2022-11-03 09:51:34 -07:00
Richard Yao
97143b9d31 Introduce kmem_scnprintf()
`snprintf()` is meant to protect against buffer overflows, but operating
on the buffer using its return value, possibly by calling it again, can
cause a buffer overflow, because it will return how many characters it
would have written if it had enough space even when it did not. In a
number of places, we repeatedly call snprintf() by successively
incrementing a buffer offset and decrementing a buffer length, by its
return value. This is a potentially unsafe usage of `snprintf()`
whenever the buffer length is reached. CodeQL complained about this.

To fix this, we introduce `kmem_scnprintf()`, which will return 0 when
the buffer is zero or the number of written characters, minus 1 to
exclude the NULL character, when the buffer was too small. In all other
cases, it behaves like snprintf(). The name is inspired by the Linux and
XNU kernels' `scnprintf()`. The implementation was written before I
thought to look at `scnprintf()` and had a good name for it, but it
turned out to have identical semantics to the Linux kernel version.
That lead to the name, `kmem_scnprintf()`.

CodeQL only catches this issue in loops, so repeated use of snprintf()
outside of a loop was not caught. As a result, a thorough audit of the
codebase was done to examine all instances of `snprintf()` usage for
potential problems and a few were caught. Fixes for them are included in
this patch.

Unfortunately, ZED is one of the places where `snprintf()` is
potentially used incorrectly. Since using `kmem_scnprintf()` in it would
require changing how it is linked, we modify its usage to make it safe,
no matter what buffer length is used. In addition, there was a bug in
the use of the return value where the NULL format character was not
being written by pwrite(). That has been fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14098
2022-10-29 13:05:11 -07:00
Aleksa Sarai
dbf6108b4d zfs_rename: support RENAME_* flags
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.

In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070
2022-10-28 09:49:20 -07:00
Aleksa Sarai
e015d6cc0b zfs_rename: restructure to have cleaner fallbacks
This is in preparation for RENAME_EXCHANGE and RENAME_WHITEOUT support
for ZoL, but the changes here allow for far nicer fallbacks than the
previous implementation (the source and target are re-linked in case of
the final link failing).

In addition, a small cleanup was done for the "target exists but is a
different type" codepath so that it's more understandable.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes #12209
Closes #14070
2022-10-28 09:48:58 -07:00
Pavel Snajdr
86db35c447 Remove zpl_revalidate: fix snapshot rollback
Open files, which aren't present in the snapshot, which is being
roll-backed to, need to disappear from the visible VFS image of
the dataset.

Kernel provides d_drop function to drop invalid entry from
the dcache, but inode can be referenced by dentry multiple dentries.

The introduced zpl_d_drop_aliases function walks and invalidates
all aliases of an inode.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #9600
Closes #14070
2022-10-28 09:47:19 -07:00
Richard Yao
4938d01db7
Convert enum zio_flag to uint64_t
We ran out of space in enum zio_flag for additional flags. Rather than
introduce enum zio_flag2 and then modify a bunch of functions to take a
second flags variable, we expand the type to 64 bits via `typedef
uint64_t zio_flag_t`.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14086
2022-10-27 09:54:54 -07:00
Andrew Innes
07de86923b
Aligned free for aligned alloc
Windows port frees memory that was alloc'd aligned in a different way
then alloc'd memory.  So changing frees to be specific.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Innes <andrew.c12@gmail.com>
Co-Authored-By: Jorgen Lundman <lundman@lundman.net>
Closes #14059
2022-10-26 15:08:31 -07:00
youzhongyang
2a068a1394
Support idmapped mount
Adds support for idmapped mounts.  Supported as of Linux 5.12 this 
functionality allows user and group IDs to be remapped without changing 
their state on disk.  This can be useful for portable home directories
and a variety of container related use cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #12923
Closes #13671
2022-10-19 11:17:09 -07:00
Richard Yao
9a8039439a Cleanup: Simplify userspace abd_free_chunks()
Clang's static analyzer complained that we could use after free here if
the inner loop ever iterated. That is a false positive, but upon
inspection, the userland abd_alloc_chunks() function never will put
multiple consecutive pages into a `struct scatterlist`, so there is no
need to loop. We delete the inner loop.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #14042
2022-10-18 15:39:21 -07:00
Tino Reichardt
27218a32fc
Fix declarations of non-global variables
This patch inserts the `static` keyword to non-global variables,
which where found by the analysis tool smatch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13970
2022-10-18 11:05:32 -07:00
Richard Yao
6a42939fcd
Cleanup: Address Clang's static analyzer's unused code complaints
These were categorized as the following:

 * Dead assignment		23
 * Dead increment		4
 * Dead initialization		6
 * Dead nested assignment	18

Most of these are harmless, but since actual issues can hide among them,
we correct them.

That said, there were a few return values that were being ignored that
appeared to merit some correction:

 * `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
   `destroy_batched()`. We handle it by returning -1 if there is an
   error.

 * `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
   `zfs_for_each()`. We handle it by doing a binary OR of the error
   value from the subsequent `zfs_for_each()` call to the existing
   value. This is how errors are mostly handled inside `zfs_for_each()`.
   The error value here is passed to exit from the zfs command, so doing
   a binary or on it is better than what we did previously.

 * `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
   `dsl_prop_get_ds()` when the property is not of type string. We
   return an error when it does. There is a small concern that the
   `zfs_get_temporary_prop()` call would handle things, but in the case
   that it does not, we would be pushing an uninitialized numval onto
   the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
   anytime that `zfs_get_temporary_prop()` does, so that not giving it a
   chance to fix things is not a problem.

 * `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
   `nvlist_add_nvlist()` twice in ways in which errors are expected to
   be impossible, so we switch to `fnvlist_add_nvlist()`.

A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:

 * `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
   value from `describe_free()`. A look through the commit history
   revealed that this was intentional.

 * `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
   returned handle from `arc_hdr_realloc()` because it is already
   referenced in lists.

 * `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
   saying not to use the error from `vdev_label_init()` because whatever
   causes the error could be the reason why a detach is being done.

Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.

After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.

My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13986
2022-10-14 13:37:54 -07:00
Christian Schwarz
4d5aef3ba9
zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.

However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.

I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.

To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem

Here's the stack trace you would see then:

  VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
  PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
  Showing stack for process 28332
  CPU: 2 PID: 28332 Comm: zpool Tainted: G           O      5.10.103-1.nutanix.el7.x86_64 #1
  Call Trace:
   dump_stack+0x74/0x92
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   dsl_dataset_disown+0xe9/0x150 [zfs]
   dmu_objset_disown+0xd6/0x150 [zfs]
   zfs_domount+0x17b/0x4b0 [zfs]
   zpl_mount+0x174/0x220 [zfs]
   legacy_get_tree+0x2b/0x50
   vfs_get_tree+0x2a/0xc0
   path_mount+0x2fa/0xa70
   do_mount+0x7c/0xa0
   __x64_sys_mount+0x8b/0xe0
   do_syscall_64+0x38/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #14025
2022-10-14 11:46:47 -07:00
Richard Yao
ab8d9c1783 Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.

Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.

After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:

Parameters that were originally 64-bit on Illumos:

 * dbuf_cache_max_bytes
 * dbuf_metadata_cache_max_bytes
 * l2arc_feed_min_ms
 * l2arc_feed_secs
 * l2arc_headroom
 * l2arc_headroom_boost
 * l2arc_write_boost
 * l2arc_write_max
 * metaslab_aliquot
 * metaslab_force_ganging
 * zfetch_array_rd_sz
 * zfs_arc_max
 * zfs_arc_meta_limit
 * zfs_arc_meta_min
 * zfs_arc_min
 * zfs_async_block_max_blocks
 * zfs_condense_max_obsolete_bytes
 * zfs_condense_min_mapping_bytes
 * zfs_deadman_checktime_ms
 * zfs_deadman_synctime_ms
 * zfs_initialize_chunk_size
 * zfs_initialize_value
 * zfs_lua_max_instrlimit
 * zfs_lua_max_memlimit
 * zil_slog_bulk

Parameters that were originally 32-bit on Illumos:

 * zfs_per_txg_dirty_frees_percent

Parameters that were originally `ssize_t` on Illumos:

 * zfs_immediate_write_sz

Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.

Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:

 * l2arc_rebuild_blocks_min_l2size
 * zfs_key_max_salt_uses
 * zfs_max_log_walking
 * zfs_max_logsm_summary_length
 * zfs_metaslab_max_size_cache_sec
 * zfs_min_metaslabs_to_flush
 * zfs_multihost_interval
 * zfs_unflushed_log_block_max
 * zfs_unflushed_log_block_min
 * zfs_unflushed_log_block_pct
 * zfs_unflushed_max_mem_amt
 * zfs_unflushed_max_mem_ppm

New parameters that do not exist in Illumos:

 * l2arc_trim_ahead
 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_arc_sys_free
 * zfs_deadman_ziotime_ms
 * zfs_delete_blocks
 * zfs_history_output_max
 * zfs_livelist_max_entries
 * zfs_max_async_dedup_frees
 * zfs_max_nvlist_src_size
 * zfs_rebuild_max_segment
 * zfs_rebuild_vdev_limit
 * zfs_unflushed_log_txg_max
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift
 * zfs_vnops_read_chunk_size
 * zvol_max_discard_blocks

Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.

A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:

 * zfs_delete_blocks
 * zfs_key_max_salt_uses
 * zvol_max_discard_blocks

The documentation for a few parameters was found to be incorrect:

 * zfs_deadman_checktime_ms - incorrectly documented as int
 * zfs_delete_blocks - not documented as Linux only
 * zfs_history_output_max - incorrectly documented as int
 * zfs_vnops_read_chunk_size - incorrectly documented as long
 * zvol_max_discard_blocks - incorrectly documented as ulong

The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.

In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:

 * vdev_file_logical_ashift
 * vdev_file_physical_ashift
 * zfs_arc_dnode_limit_percent
 * zfs_arc_dnode_reduce_percent
 * zfs_arc_meta_limit_percent
 * zfs_per_txg_dirty_frees_percent
 * zfs_unflushed_log_block_pct
 * zfs_vdev_max_auto_ashift
 * zfs_vdev_min_auto_ashift

Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.

Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-13 10:03:29 -07:00
Richard Yao
a6ccb36b94
Add defensive assertions
Coverity complains about possible bugs involving referencing NULL return
values and division by zero. The division by zero bugs require that a
block pointer be corrupt, either from in-memory corruption, or on-disk
corruption. The NULL return value complaints are only bugs if
assumptions that we make about the state of data structures are wrong.
Some seem impossible to be wrong and thus are false positives, while
others are hard to analyze.

Rather than dismiss these as false positives by assuming we know better,
we add defensive assertions to let us know when our assumptions are
wrong.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13972
2022-10-12 11:25:18 -07:00
Jorgen Lundman
4b629d04a5
Avoid calling rw_destroy() on uninitialized rwlock
First the function `memset(&key, 0, ...)` but
any call to "goto error;" would call zio_crypt_key_destroy(key) which
calls `rw_destroy()`. The `rw_init()` is moved up to be right after the
memset. This way the rwlock can be released.

The ctx does allocate memory, but that is handled by the memset to 0
and icp skips NULL ptrs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #13976
2022-10-05 17:07:50 -07:00
Tino Reichardt
a2d5643f88
Fix double const qualifier declarations
Some header files define structures like this one:

typedef const struct zio_checksum_info {
	/* ... */
	const char	*ci_name;
} zio_abd_checksum_func_t;

So we can use `zio_abd_checksum_func_t` for const declarations now.
It's not needed that we use the `const` qualifier again like this:
`const zio_abd_checksum_func_t *varname;`

This patch solves the double const qualifiers, which were found by
smatch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13961
2022-09-30 15:34:39 -07:00
Ameer Hamza
55c12724d3
zed: mark disks as REMOVED when they are removed
ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be properly marked as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13797
2022-09-28 09:48:46 -07:00
Richard Yao
fdc2d30371
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.

There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.

Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.

Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.

Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.

The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:

	* metaslab_load_pct
	* zfs_sync_taskq_batch_pct
	* zfs_zil_clean_taskq_nthr_pct
	* zfs_zil_clean_taskq_minalloc
	* zfs_zil_clean_taskq_maxalloc
	* zfs_arc_prune_task_threads

Also, negative values in those parameters was found to be harmless.

The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:

	* zfs_metaslab_switch_threshold
	* zfs_pd_bytes_max
	* zfs_livelist_min_percent_shared

zfs_multihost_history was made static to be consistent with other
parameters.

A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.

Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.

Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:

	* zfs_arc_meta_adjust_restarts
	* zfs_override_estimate_recordsize

They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.

dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.

If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.

This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.

Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models.  As a result, we change the existing parameters
that are uint32_t to use uint_t.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-27 16:42:41 -07:00
Richard Yao
8ef15f9322
Cleanup: Remove ineffective unsigned comparisons against 0
Coverity found a number of places where we either do MAX(unsigned, 0) or
do assertions that a unsigned variable is >= 0. These do nothing, so
let us drop them all.

It also found a spot where we do `if (unsigned >= 0 && ...)`. Let us
also drop the unsigned >= 0 check.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13871
2022-09-26 17:02:38 -07:00
Richard Yao
52afc3443d
Linux: Fix uninitialized variable usage in zio_do_crypt_data()
Coverity complained about this. An error from `hkdf_sha512()` before uio
initialization will cause pointers to uninitialized memory to be passed
to `zio_crypt_destroy_uio()`. This is a regression that was introduced
by cf63739191. Interestingly, this never
affected FreeBSD, since the FreeBSD version never had that patch ported.
Since moving uio initialization to the top of this function would slow
down the qat_crypt() path, we only move the `memset()` calls to the top
of the function. This is sufficient to fix this problem.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13944
2022-09-26 16:44:22 -07:00
Richard Yao
891ac937be
Linux: Fix use-after-free in zfsvfs_create()
Coverity reported that we pass a pointer to zfsvfs to
`dmu_objset_disown()` after freeing zfsvfs in zfsvfs_create_impl() after
a failure in zfsvfs_init().

We have nearly identical duplicate versions of this code for FreeBSD and
Linux, but interestingly, the FreeBSD version of this code differs in
such a way that it does not suffer from this bug. We remove the
difference from the FreeBSD version to fix this bug.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13883
2022-09-19 17:30:58 -07:00
Brian Behlendorf
7dee043af5 zfs_enter rework followup
The zpl_fadvise() function was recently added and was not included
in the initial patch.  Update it accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13831
2022-09-16 14:25:53 -07:00
Chunwei Chen
768eacedef
zfs_enter rework
Replace ZFS_ENTER and ZFS_VERIFY_ZP, which have hidden returns, with
functions that return error code. The reason we want to do this is
because hidden returns are not obvious and had caused some missing fail
path unwinding.

This patch changes the common, linux, and freebsd parts. Also fixes
fail path unwinding in zfs_fsync, zpl_fsync, zpl_xattr_{list,get,set}, and
zfs_lookup().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #13831
2022-09-16 13:36:47 -07:00
Richard Yao
d5d10f2aef
Cleanup dead spa_boot code
Unused code detected by coverity.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13868
2022-09-13 16:40:10 -07:00
Finix1979
320f0c6022
Add Linux posix_fadvise support
The purpose of this PR is to accepts fadvise ioctl from userland
to do read-ahead by demand.

It could dramatically improve sequential read performance especially
when primarycache is set to metadata or zfs_prefetch_disable is 1.

If the file is mmaped, generic_fadvise is also called for page cache
read-ahead besides dmu_prefetch.

Only POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL are supported in
this PR currently.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Finix Yan <yancw@info2soft.com>
Closes #13694
2022-09-08 10:29:41 -07:00
Brian Behlendorf
c26045b435 Linux 5.20 compat: blk_cleanup_disk()
As of the Linux 5.20 kernel blk_cleanup_disk() has been removed,
all callers should use put_disk().

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13728
2022-08-04 16:57:49 -07:00
Brian Behlendorf
bebdf52a16 Linux 5.20 compat: bdevname()
As of the Linux 5.20 kernel bdevname() has been removed, all
callers should use snprintf() and the "%pg" format specifier.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13728
2022-08-04 16:57:33 -07:00
ixhamza
fb087146de
Add support for per dataset zil stats and use wmsum counters
ZIL kstats are reported in an inclusive way, i.e., same counters are
shared to capture all the activities happening in zil. Added support
to report zil stats for every datset individually by combining them
with already exposed dataset kstats.

Wmsum uses per cpu counters and provide less overhead as compared
to atomic operations. Updated zil kstats to replace wmsum counters
to avoid atomic operations.

Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13636
2022-07-20 17:14:06 -07:00
ixhamza
f371cc18f8
Expose ZFS dataset case sensitivity setting via sb_opts
Makes the case sensitivity setting visible on Linux in /proc/mounts.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #13607
2022-07-14 10:38:16 -07:00
Tino Reichardt
1d3ba0bf01
Replace dead opensolaris.org license link
The commit replaces all findings of the link:
http://www.opensolaris.org/os/licensing with this one:
https://opensource.org/licenses/CDDL-1.0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #13619
2022-07-11 14:16:13 -07:00
Brian Behlendorf
e6489be347
Linux: Align MODULE_LICENSE macro text
Specify the lua and zstd license text in the manor in which the
kernel MODULE_LICENSE macro requires it.  The now duplicate entries
were merged and a comment added to make it clear what they apply to.

Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13641
2022-07-11 11:29:12 -07:00
наб
dd66857d92 Remaining {=> const} char|void *tag
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13348
2022-06-29 14:08:59 -07:00
gaoyanping
e7d90362e5
Fix znode group permission different from acl mask
Zp->z_mode is set at the same time inode->i_mode
is being changed. This has the effect of keeping both
in sync without relying on zfs_znode_update_vfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: yanping.gao <yanping.gao@xtaotech.com>
Closes #13581
2022-06-29 13:38:46 -07:00
Andrew
f609739985
expose snapshot count via stat(2) of .zfs/snapshot (#13559)
Increase nlinks in stat results of ./zfs/snapshot based on snapshot
count. This provides quick and efficient method for administrators to
get snapshot counts without having to use libzfs or list the snapdir
contents.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes #13559
2022-06-17 11:44:49 -07:00
Damian Szuberski
9884319666
Fix clang 13 compilation errors
```
os/linux/zfs/zvol_os.c:1111:3: error: ignoring return value of function
  declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
                add_disk(zv->zv_zso->zvo_disk);
                ^~~~~~~~ ~~~~~~~~~~~~~~~~~~~~

zpl_xattr.c:1579:1: warning: no previous prototype for function
  'zpl_posix_acl_release_impl' [-Wmissing-prototypes]
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13551
2022-06-15 14:20:28 -07:00
Will Andrews
4ed5e25074 Add Linux namespace delegation support
This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes

As a user:
```
 $ unshare -Um
 $ zfs list
no datasets available
 $ echo $$
1234
```

As root:
```
 # zfs list
NAME                            ZONED  MOUNTPOINT
containers                      off    /containers
containers/host                 off    /containers/host
containers/host/child           off    /containers/host/child
containers/host/child/gchild    off    /containers/host/child/gchild
containers/unpriv               on     /unpriv
containers/unpriv/child         on     /unpriv/child
containers/unpriv/child/gchild  on     /unpriv/child/gchild

 # zfs zone /proc/1234/ns/user containers/unpriv
```

Back to the user namespace:
```
 $ zfs list
NAME                             USED  AVAIL     REFER  MOUNTPOINT
containers                       129M  47.8G       24K  /containers
containers/unpriv                128M  47.8G       24K  /unpriv
containers/unpriv/child          128M  47.8G      128M  /unpriv/child
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Buddy <https://buddy.works>
Closes #12263
2022-06-10 09:51:46 -07:00
Tony Hutter
6f73d02168
zvol: Support blk-mq for better performance
Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver.  blk-mq creates multiple request queues on
different CPUs rather than having a single request queue.  This can
improve zvol performance with multithreaded reads/writes.

This implementation uses the blk-mq interfaces on 4.13 or newer
kernels.  Building against older kernels will fall back to the
older BIO interfaces.

Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API.  It is disabled by default.

In addition, this commit lets the zvol blk-mq layer process whole
`struct request` IOs at a time, rather than breaking them down
into their individual BIOs.  This reduces dbuf lock contention
and overhead versus the legacy zvol submit_bio() codepath.

	sequential dd to one zvol, 8k volblocksize, no O_DIRECT:

	legacy submit_bio()     292MB/s write  453MB/s read
	this commit             453MB/s write  885MB/s read

It also introduces a new `zvol_blk_mq_chunks_per_thread` module
parameter. This parameter represents how many volblocksize'd chunks
to process per each zvol thread.  It can be used to tune your zvols
for better read vs write performance (higher values favor write,
lower favor read).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13148
Issue #12483
2022-06-09 08:10:38 -06:00
Brian Behlendorf
c2c2e7bb8b Linux 5.19 compat: aops->read_folio()
As of the Linux 5.19 kernel the readpage() address space operation
has been replaced by read_folio().

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:31 -07:00
Brian Behlendorf
a12a5cb5b8 Linux 5.19 compat: blkdev_issue_secure_erase()
Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
erase functionality from the blkdev_issue_discard() function.
The blkdev_issue_secure_erase() must now be issued to issue
a secure erase.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:26 -07:00
Brian Behlendorf
e2c31f2bc7 Linux 5.19 compat: bdev_max_secure_erase_sectors()
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function.  The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:22 -07:00
Brian Behlendorf
5e4aedaca7 Linux 5.19 compat: bdev_max_discard_sectors()
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function.  The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:17 -07:00
Brian Behlendorf
5f264996f4 Linux 5.18 compat: bio_alloc()
As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
as an argument.  This removes the need for the bio_set_dev() compatibility
code for 5.18 and newer kernels.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:03 -07:00
szubersk
e0911f7b7f autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol
A followup to 849c14e048
Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13389
2022-05-11 10:32:51 -07:00
hping
a18d13c200
abd_os: remove redundant refcount creation for abd_children
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.

In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13429
2022-05-09 16:30:16 -07:00
Shaan Nobee
411f4a018d
Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes #12662
Closes #12790
2022-05-03 13:23:26 -07:00
Brian Behlendorf
63b18e4097
Fix O_APPEND for Linux 3.15 and older kernels
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks().  The updated pos variable  was incorrectly
ignored resulting in the current offset being used.

This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13370 
Closes #13377
2022-04-27 12:56:17 -07:00