Commit Graph

339 Commits

Author SHA1 Message Date
Will Andrews
4ed5e25074 Add Linux namespace delegation support
This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes

As a user:
```
 $ unshare -Um
 $ zfs list
no datasets available
 $ echo $$
1234
```

As root:
```
 # zfs list
NAME                            ZONED  MOUNTPOINT
containers                      off    /containers
containers/host                 off    /containers/host
containers/host/child           off    /containers/host/child
containers/host/child/gchild    off    /containers/host/child/gchild
containers/unpriv               on     /unpriv
containers/unpriv/child         on     /unpriv/child
containers/unpriv/child/gchild  on     /unpriv/child/gchild

 # zfs zone /proc/1234/ns/user containers/unpriv
```

Back to the user namespace:
```
 $ zfs list
NAME                             USED  AVAIL     REFER  MOUNTPOINT
containers                       129M  47.8G       24K  /containers
containers/unpriv                128M  47.8G       24K  /unpriv
containers/unpriv/child          128M  47.8G      128M  /unpriv/child
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Buddy <https://buddy.works>
Closes #12263
2022-06-10 09:51:46 -07:00
Tony Hutter
6f73d02168
zvol: Support blk-mq for better performance
Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver.  blk-mq creates multiple request queues on
different CPUs rather than having a single request queue.  This can
improve zvol performance with multithreaded reads/writes.

This implementation uses the blk-mq interfaces on 4.13 or newer
kernels.  Building against older kernels will fall back to the
older BIO interfaces.

Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API.  It is disabled by default.

In addition, this commit lets the zvol blk-mq layer process whole
`struct request` IOs at a time, rather than breaking them down
into their individual BIOs.  This reduces dbuf lock contention
and overhead versus the legacy zvol submit_bio() codepath.

	sequential dd to one zvol, 8k volblocksize, no O_DIRECT:

	legacy submit_bio()     292MB/s write  453MB/s read
	this commit             453MB/s write  885MB/s read

It also introduces a new `zvol_blk_mq_chunks_per_thread` module
parameter. This parameter represents how many volblocksize'd chunks
to process per each zvol thread.  It can be used to tune your zvols
for better read vs write performance (higher values favor write,
lower favor read).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #13148
Issue #12483
2022-06-09 08:10:38 -06:00
Brian Behlendorf
c2c2e7bb8b Linux 5.19 compat: aops->read_folio()
As of the Linux 5.19 kernel the readpage() address space operation
has been replaced by read_folio().

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:31 -07:00
Brian Behlendorf
a12a5cb5b8 Linux 5.19 compat: blkdev_issue_secure_erase()
Linux 5.19 commit torvalds/linux@44abff2c0 splits the secure
erase functionality from the blkdev_issue_discard() function.
The blkdev_issue_secure_erase() must now be issued to issue
a secure erase.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:26 -07:00
Brian Behlendorf
e2c31f2bc7 Linux 5.19 compat: bdev_max_secure_erase_sectors()
Linux 5.19 commit torvalds/linux@44abff2c0 removed the
blk_queue_secure_erase() helper function.  The preferred
interface is to now use the bdev_max_secure_erase_sectors()
function to check for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:22 -07:00
Brian Behlendorf
5e4aedaca7 Linux 5.19 compat: bdev_max_discard_sectors()
Linux 5.19 commit torvalds/linux@70200574cc removed the
blk_queue_discard() helper function.  The preferred interface
is to now use the bdev_max_discard_sectors() function to check
for discard support.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:17 -07:00
Brian Behlendorf
5f264996f4 Linux 5.18 compat: bio_alloc()
As for the Linux 5.18 kernel bio_alloc() expects a block_device struct
as an argument.  This removes the need for the bio_set_dev() compatibility
code for 5.18 and newer kernels.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13515
2022-05-31 12:04:03 -07:00
наб
de82164518 linux: spl: generic: ddi_strto*: match solaris ddi_strto*(9)
Recognise initial whitespace, + in both cases,
and - also in unsigneds

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13434
2022-05-13 10:15:47 -07:00
наб
354a1bfb8e linux: spl: generic: ddi_strtou##type: elide unused flag
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13434
2022-05-13 10:15:44 -07:00
наб
c25b281378 Remove hw_serial, ddi_strtoul()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13434
2022-05-13 10:15:31 -07:00
szubersk
e0911f7b7f autoconf: Fail when __copy_from_user_inatomic is a non-GPL symbol
A followup to 849c14e048
Fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009242

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #13389
2022-05-11 10:32:51 -07:00
hping
a18d13c200
abd_os: remove redundant refcount creation for abd_children
Refcount creation for abd_zero_scatter->abd_children is redundant in
abd_alloc_zero_scatter, as it has been done in abd_init_struct.

In addition, abd_children is undefined when ZFS_DEBUG is disabled, the
reference of abd_children in abd_alloc_zero_scatter breaks build of
libzpool when ZFS_DEBUG is disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13429
2022-05-09 16:30:16 -07:00
Aidan Harris
493b6e5607
Fix functions without a prototype
clang-15 emits the following error message for functions without
a prototype:

fs/zfs/os/linux/spl/spl-kmem-cache.c:1423:27: error:
  a function declaration without a prototype is deprecated
  in all versions of C [-Werror,-Wstrict-prototypes]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Aidan Harris <me@aidanharr.is>
Closes #13421
2022-05-06 11:57:37 -07:00
Shaan Nobee
411f4a018d
Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes #12662
Closes #12790
2022-05-03 13:23:26 -07:00
Brian Behlendorf
63b18e4097
Fix O_APPEND for Linux 3.15 and older kernels
When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks().  The updated pos variable  was incorrectly
ignored resulting in the current offset being used.

This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13370 
Closes #13377
2022-04-27 12:56:17 -07:00
Satadru Pramanik
7dde17e860
Linux 5.18 compat: replace __set_page_dirty_nobuffers
Replace __set_page_dirty_nobuffers with filemap_dirty_folio.

Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a
("Merge tag 'folio-5.18b' of
git://git.infradead.org/users/willy/pagecache ")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Satadru Pramanik <satadru@gmail.com>
Closes #13325
Closes #13380
2022-04-27 12:54:17 -07:00
Damian Szuberski
849c14e048
PPC get_user workaround
Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #11958
Closes #12590
Closes #13367
2022-04-26 10:52:40 -07:00
наб
0cdda2edb3 Linux 5.18 compat: kobj_type.default_attrs replaced with default_groups
Upstream-commit: cdb4f26a63c391317e335e6e683a614358e70aeb ("kobject:
 kobj_type: remove default_attrs")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13357
2022-04-22 14:27:10 -07:00
наб
520a7c8a85 linux: module: zfs: sysfs: constify types and attrs
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13357
2022-04-22 14:27:02 -07:00
Rich Ercolani
b657f2c592
Corrected oversight in ZERO_RANGE behavior
It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.

Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #13329 
Closes #13338
2022-04-20 16:07:03 -07:00
наб
ad9e767657 linux: module: weld all but spl.ko into zfs.ko
Originally it was thought it would be useful to split up the kmods
by functionality.  This would allow external consumers to only load
what was needed.  However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward.  Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13274
2022-04-20 13:28:24 -07:00
Brian Behlendorf
026f126b83
Linux 5.17 compat: GENHD_FL_EXT_DEVT / GENHD_FL_NO_PART_SCAN
As of the 5.17 kernel the GENHD_FL_EXT_DEVT flag has been removed
and the GENHD_FL_NO_PART_SCAN flag renamed GENHD_FL_NO_PART. Update
zvol_alloc() to set GENHD_FL_NO_PART for the newer kernels which
is sufficient.  The behavior for prior kernels remains unchanged.

1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
46e7eac6 ("block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART")

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13294
Closes #13297
2022-04-19 10:38:04 -07:00
Riccardo Schirone
7d524c068d Linux 5.18 compat: use address_space_operations->readahead
->readpages was removed and replaced by ->readahead. Define
zpl_readahead for kernels that don't have ->readpages.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Riccardo Schirone <rschirone91@gmail.com>
Closes #13278
2022-04-04 09:35:24 -07:00
Riccardo Schirone
036e846abc Linux 5.18 compat: blkg_tryget is moved to private headers
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Riccardo Schirone <rschirone91@gmail.com>
Closes #13278
2022-04-04 09:35:11 -07:00
Andrew
eebfd28e9d
Linux optimize access checks when ACL is trivial
Bypass check of ZFS aces if the ACL is trivial. When an ACL is
trivial its permissions are represented by the mode without any
loss of information. In this case, it is safe to convert the
access request into equivalent mode and then pass desired mask
and inode to generic_permission(). This has the added benefit
of also checking whether entries in a POSIX ACL on the file grant
the desired access.

This commit also skips the ACL check on looking up the xattr dir
since such restrictions don't exist in Linux kernel and it makes
xattr lookup behavior inconsistent between SA and file-based
xattrs. We also don't want to perform a POSIX ACL check while
looking up the POSIX ACL if for some reason it is located in
the xattr dir rather than an SA.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Closes #13237
2022-04-01 09:53:54 -07:00
hpingfs
4d04e41e4d
zfs_ctldir: fix incorrect argument type of rw_destroy
The argument type of rw_destroy is (krwlock_t *) while currently
krwlock_t is passed in zfs_ctldir.c. This error is hidden because
rw_destroy is defined as ((void) 0) in linux. But anyway, this
mismatch should be fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13272
2022-03-30 15:40:31 -07:00
hpingfs
abdcef47d2
zvol_os: suppress compiler warning for zvol_open_timeout_ms
When HAVE_BLKDEV_GET_ERESTARTSYS is defined, compiler will complain
"defined but not used" warning for zvol_open_timeout_ms.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ping Huang <huangping@smartx.com>
Closes #13270
2022-03-30 15:39:55 -07:00
наб
86690775b0 Linux 5.18 compat: replace genhd.h with blkdev.h includes
blkdev.h includes genhd.h since dawn of upstream git, so this is
globally safe

Upstream-commit: 322cbb50de711814c42fb088f6d31901502c711a ("block:
 remove genhd.h")

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13251
2022-03-28 09:52:55 -07:00
наб
d1325b4fa2 Linux 5.18 compat: 4-argument bio_alloc()
bio_alloc(gfp_t gfp_mask, unsigned short nr_iovecs)

became

  bio_alloc(struct block_device *bdev, unsigned short nr_vecs,
            unsigned int opf, gfp_t gfp_mask)
passing NULL/0 continues previous behaviour

Upstream-commit: 07888c665b405b1cd3577ddebfeb74f4717a84c4 ("block:
 pass a block_device and opf to bio_alloc")

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13251
2022-03-28 09:51:59 -07:00
Brian Behlendorf
460748d4ae
Switch from _Noreturn to __attribute__((noreturn))
Parts of the Linux kernel build system struggle with _Noreturn.  This
results in the following warnings when building on RHEL 8.5, and likely
other environments.  Switch to using the __attribute__((noreturn)).

  warning: objtool: dbuf_free_range()+0x2b8:
    return with modified stack frame
  warning: objtool: dbuf_free_range()+0x0:
    stack state mismatch: cfa1=7+40 cfa2=7+8
  ...
  WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation
    failed, symbol will not be versioned.
  WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation
    failed, symbol will not be versioned.
  ...

Additionally, __thread_exit() has been renamed spl_thread_exit() and
made a static inline function.  This was needed because the kernel
will generate a warning for symbols which are __attribute__((noreturn))
and then exported with EXPORT_SYMBOL.

While we could continue to use _Noreturn in user space I've also
switched it to __attribute__((noreturn)) purely for consistency
throughout the code base.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #13238
2022-03-23 08:51:00 -07:00
Ryan Moeller
d42979c6ef
Fix ACL checks for NFS kernel server
This PR changes ZFS ACL checks to evaluate
fsuid / fsgid rather than euid / egid to avoid
accidentally granting elevated permissions to
NFS clients.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Andrew Walker <awalker@ixsystems.com>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #13221
2022-03-18 06:47:57 -06:00
наб
d465fc5844 Forbid b{copy,zero,cmp}(). Don't include <strings.h> for <string.h>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12996
2022-03-15 15:13:48 -07:00
наб
861166b027 Remove bcopy(), bzero(), bcmp()
bcopy() has a confusing argument order and is actually a move, not a
copy; they're all deprecated since POSIX.1-2001 and removed in -2008,
and we shim them out to mem*() on Linux anyway

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12996
2022-03-15 15:13:42 -07:00
Brian Atkinson
becc717f61
Adding ZERO_PAGE detection
On some architectures ZERO_PAGE is unavailable because it references
a GPL exported symbol of empty_zero_page. Originally e08b993 removed
the call to PAGE_ZERO(0) for assignment to the abd_zero_page. However,
a simple check can be done to avoid a kernel allocation and free for
the abd_zero_page if ZERO_PAGE is available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #13199
2022-03-14 12:37:39 -07:00
Umer Saleem
39a4daf742
Expose additional file level attributes
ZFS allows to update and retrieve additional file level attributes for
FreeBSD. This commit allows additional file level attributes to be
updated and retrieved for Linux. These include the flags stored in the
upper half of z_pflags only.

Two new IOCTLs have been added for this purpose. ZFS_IOC_GETDOSFLAGS
can be used to retrieve the attributes, while ZFS_IOC_SETDOSFLAGS can
be used to update the attributes.

Attributes that are allowed to be updated include ZFS_IMMUTABLE,
ZFS_APPENDONLY, ZFS_NOUNLINK, ZFS_ARCHIVE, ZFS_NODUMP, ZFS_SYSTEM,
ZFS_HIDDEN, ZFS_READONLY, ZFS_REPARSE, ZFS_OFFLINE and ZFS_SPARSE.
Flags can be or'd together while calling ZFS_IOC_SETDOSFLAGS.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #13118
2022-03-07 17:52:03 -08:00
Jitendra Patidar
361a7e8211
log xattr=sa create/remove/update to ZIL
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.

For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.

This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.

Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:

1. The feature must be enabled. Otherwise, we don't log the new
    record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
    the ZIL chain will activate the feature for the dataset. Likewise
    for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.

We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.

We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.

One more piece on feature activation, since it's spread across
multiple functions:

zil_commit()
  zil_process_commit_list()
    if lwb == NULL // first zil_commit since zil_open
      zil_create()
        if no log block pointer in ZIL header:
          if feature enabled and not active:
	    // CASE 1
            enable, COALESCE txg wait with dmu_tx that allocated the
	    log block
         else // log block was allocated earlier than this zil_open
          if feature enabled and not active:
	    // CASE 2
            enable, EXPLICIT txg wait
    else // already have an in-DRAM LWB
      if feature enabled and not active:
        // this happens when we enable the feature after zil_create
	// CASE 3
        enable, EXPLICIT txg wait

[1] da6c28aaf6

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #8768 
Closes #9078
2022-02-22 13:06:43 -08:00
Damian Szuberski
a014378dd0
spl: make 'spl_panic_halt' working for all cases
The default behavior where the serious ZFS errors cause FS thread to
stuck is very bad for some production scenario.

In some production scenarios (Linux), it is recommended to make real
kernel PANIC, where system can be rebooted by watchdog or kernel itself.
This patch enables coherent handling of spl_panic_halt parameter.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Wojciech Nizinski <w.nizinski@grinn-global.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes #12120
Closes #13109
2022-02-18 11:43:11 -08:00
наб
ef70eff198 module: mark arguments used
Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13110
2022-02-18 09:34:03 -08:00
наб
51c747de43 linux: module/zfs: vnops: make null_xattr static
Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #13110
2022-02-18 09:33:20 -08:00
Ryan Moeller
5c0061345b
Cross-platform xattr user namespace compatibility
ZFS on Linux originally implemented xattr namespaces in a way that is
incompatible with other operating systems.  On illumos, xattrs do not
have namespaces.  Every xattr name is visible.  FreeBSD has two
universally defined namespaces: EXTATTR_NAMESPACE_USER and
EXTATTR_NAMESPACE_SYSTEM.  The system namespace is used for protected
FreeBSD-specific attributes such as MAC labels and pnfs state.  These
attributes have the namespace string "freebsd:system:" prefixed to the
name in the encoding scheme used by ZFS.  The user namespace is used
for general purpose user attributes and obeys normal access control
mechanisms.  These attributes have no namespace string prefixed, so
xattrs written on illumos are accessible in the user namespace on
FreeBSD, and xattrs written to the user namespace on FreeBSD are
accessible by the same name on illumos.

Linux has several xattr namespaces.  On Linux, ZFS encodes the
namespace in the xattr name for every namespace, including the user
namespace.  As a consequence, an xattr in the user namespace with the
name "foo" is stored by ZFS with the name "user.foo" and therefore
appears on FreeBSD and illumos to have the name "user.foo" rather than
"foo".  Conversely, none of the xattrs written on FreeBSD or illumos
are accessible on Linux unless the name happens to be prefixed with one
of the Linux xattr namespaces, in which case the namespace is stripped
from the name.  This makes xattrs entirely incompatible between Linux
and other platforms.

We want to make the encoding of user namespace xattrs compatible across
platforms.  A critical requirement of this compatibility is for xattrs
from existing pools from FreeBSD and illumos to be accessible by the
same names in the user namespace on Linux.  It is also necessary that
existing pools with xattrs written by Linux retain access to those
xattrs by the same names on Linux.  Making user namespace xattrs from
Linux accessible by the correct names on other platforms is important.
The handling of other namespaces is not required to be consistent.

Add a fallback mechanism for listing and getting xattrs to treat xattrs
as being in the user namespace if they do not match a known prefix.

Do not allow setting or getting xattrs with a name that is prefixed
with one of the namespace names used by ZFS on supported platforms.

Allow choosing between legacy illumos and FreeBSD compatibility and
legacy Linux compatibility with a new tunable.  This facilitates
replication and migration of pools between hosts with different
compatibility needs.

The tunable controls whether or not to prefix the namespace to the
name.  If the xattr is already present with the alternate prefix,
remove it so only the new version persists.  By default the platform's
existing convention is used.

Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11919
2022-02-15 16:35:30 -08:00
наб
df7b54f1d9 module: icp: rip out insane crypto_req_handle_t mechanism, inline KM_SLEEP
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12901
2022-02-15 16:25:37 -08:00
наб
64e82cea13 module: icp: remove set-but-unused cd_miscdata
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12901
2022-02-15 16:25:13 -08:00
наб
739afd9475 module: icp: fold away all key formats except CRYPTO_KEY_RAW
It's the only one actually used

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12901
2022-02-15 16:25:07 -08:00
наб
eb1e09b7ec module: icp: remove unused CRYPTO_ALWAYS_QUEUE
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12901
2022-02-15 16:24:19 -08:00
наб
710657f51d module: icp: remove other provider types
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #12901
2022-02-15 16:23:53 -08:00
Jorgen Lundman
4759342a5e
Add spa _os() hooks
Add hooks for when spa is created, exported, activated and
deactivated. Used by macOS to attach iokit, and lock
kext as busy (to stop unloads).

Userland, Linux, and, FreeBSD have empty stubs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12801
2022-02-15 15:54:25 -08:00
Jorgen Lundman
9a70e97fe1
Rename fallthrough to zfs_fallthrough
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #13097
2022-02-15 08:58:59 -08:00
drowfx
3819aaaff9
Add dataset_kstats_update.. to mmap read/write paths
This allows reads/writes caused by accesses to mmap files to be
accounted correctly in the per-dataset kstats for both Linux and
FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Matthias Blankertz <matthias@blankertz.org>
Closes #12994 
Closes #13044
2022-02-09 14:41:42 -08:00
Jorgen Lundman
c28d6ab08b
Rename EMPTY_TASKQ into taskq_empty
To follow a change in illumos taskq

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #12802
2022-02-09 15:04:26 -07:00
Christian Schwarz
1dccfd7a38
zvol: make calls to platform ops static
There's no need to make the platform ops dynamic dispatch.

This change replaces the dynamic dispatch with static calls to the
platform-specific functions.
To avoid name collisions, prefix all platform-specific functions
with `zvol_os_`.
I actually find `zvol_..._os` slightly nicer to read in the calling
code, but having it as a prefix is useful.

Advantage:
- easier jump-to-definition / grepping
- potential benefits to static analysis
- better legibility

Future work: also prefix remaining `static` functions in zvol_os.c.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #12965
2022-02-07 10:24:38 -08:00