Commit Graph

976 Commits

Author SHA1 Message Date
Matthew Ahrens
ba67d82142
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes.  Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.

This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`.  The dbuf and arc buf for this
"dirty leaf block" are not instantiated.

Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data).  Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.

Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.

Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU.  On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.

Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%

The code is also restructured in a few ways:

Added a `dr_dnode` field to the dbuf_dirty_record_t.  This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines.  The new field is needed by the lightweight-type dirty record.

To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t.  To do this we keep a hold on the dnode until it's zio's have
completed.  This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).

`dn_dirty_txg` was previously zeroed when the dnode was synced.  This
was not necessary, since its meaning can be "when was this dnode last
dirtied".  This change simplifies the new `dnode_rele_task()` code.

Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 10:26:02 -08:00
Paul Dagnelie
60a4c7d2a2
Implement memory and CPU hotplug
ZFS currently doesn't react to hotplugging cpu or memory into the 
system in any way. This patch changes that by adding logic to the ARC 
that allows the system to take advantage of new memory that is added 
for caching purposes. It also adds logic to the taskq infrastructure 
to support dynamically expanding the number of threads allocated to a 
taskq.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11212
2020-12-10 14:09:23 -08:00
Matthew Macy
1e4732cbda
Decouple arc_read_done callback from arc buf instantiation
Add ARC_FLAG_NO_BUF to indicate that a buffer need not be
instantiated.  This fixes a ~20% performance regression on
cached reads due to zfetch changes.

Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11220 
Closes #11232
2020-12-09 15:05:06 -08:00
Alexander Motin
6f5aac3ca0
Reduce latency effects of non-interactive I/O
Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
 - by using _min_active queue depths for non-interactive requests while
   the interactive request(s) are active and few requests after;
 - by throttling it further if no interactive requests has completed
   while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,   70, 107,   45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
2020-11-24 09:26:42 -08:00
Matthew Macy
0ca45cb310
Fix problems in zvol_set_volmode_impl
- Don't leave fstrans set when passed a snapshot
- Don't remove minor if volmode already matches new value
- (FreeBSD) Wait for GEOM ops to complete before trying
  remove (at create time GEOM will be "tasting" in parallel)
- (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL
- (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11199
2020-11-17 09:50:52 -08:00
loli10K
4072f465bc
Fix 'zfs userspace' for received datasets in encrypted root
For encrypted receives, where user accounting is initially disabled on
creation, both 'zfs userspace' and 'zfs groupspace' fails with
EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to
set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a
successful dmu_objset_space_upgrade().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9501 
Closes #9596
2020-11-16 09:10:29 -08:00
Matthew Ahrens
d66aab7c08
Assertion failure when logging large output of channel program
The output of ZFS channel programs is logged on-disk in the zpool
history, and printed by `zpool history -i`.  Channel programs can use
10MB of memory by default, and up to 100MB by using the `zfs program -m`
flag.  Therefore their output can be up to some fraction of 100MB.

In addition to being somewhat wasteful of the limited space reserved for
the pool history (which for large pools is 1GB), in extreme cases this
can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in
`dmu_buf_hold_array_by_dnode()`.

This commit limits the output size that will be logged to 1MB.  Larger
outputs will not be logged, instead a entry will be logged indicating
the size of the omitted output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11194
2020-11-14 10:17:16 -08:00
Brian Behlendorf
b2255edcc0
Distributed Spare (dRAID) Feature
This patch adds a new top-level vdev type called dRAID, which stands
for Distributed parity RAID.  This pool configuration allows all dRAID
vdevs to participate when rebuilding to a distributed hot spare device.
This can substantially reduce the total time required to restore full
parity to pool with a failed device.

A dRAID pool can be created using the new top-level `draid` type.
Like `raidz`, the desired redundancy is specified after the type:
`draid[1,2,3]`.  No additional information is required to create the
pool and reasonable default values will be chosen based on the number
of child vdevs in the dRAID vdev.

    zpool create <pool> draid[1,2,3] <vdevs...>

Unlike raidz, additional optional dRAID configuration values can be
provided as part of the draid type as colon separated values. This
allows administrators to fully specify a layout for either performance
or capacity reasons.  The supported options include:

    zpool create <pool> \
        draid[<parity>][:<data>d][:<children>c][:<spares>s] \
        <vdevs...>

    - draid[parity]       - Parity level (default 1)
    - draid[:<data>d]     - Data devices per group (default 8)
    - draid[:<children>c] - Expected number of child vdevs
    - draid[:<spares>s]   - Distributed hot spares (default 0)

Abbreviated example `zpool status` output for a 68 disk dRAID pool
with two distributed spares using special allocation classes.

```
  pool: tank
 state: ONLINE
config:

    NAME                  STATE     READ WRITE CKSUM
    slag7                 ONLINE       0     0     0
      draid2:8d:68c:2s-0  ONLINE       0     0     0
        L0                ONLINE       0     0     0
        L1                ONLINE       0     0     0
        ...
        U25               ONLINE       0     0     0
        U26               ONLINE       0     0     0
        spare-53          ONLINE       0     0     0
          U27             ONLINE       0     0     0
          draid2-0-0      ONLINE       0     0     0
        U28               ONLINE       0     0     0
        U29               ONLINE       0     0     0
        ...
        U42               ONLINE       0     0     0
        U43               ONLINE       0     0     0
    special
      mirror-1            ONLINE       0     0     0
        L5                ONLINE       0     0     0
        U5                ONLINE       0     0     0
      mirror-2            ONLINE       0     0     0
        L6                ONLINE       0     0     0
        U6                ONLINE       0     0     0
    spares
      draid2-0-0          INUSE     currently in use
      draid2-0-1          AVAIL
```

When adding test coverage for the new dRAID vdev type the following
options were added to the ztest command.  These options are leverages
by zloop.sh to test a wide range of dRAID configurations.

    -K draid|raidz|random - kind of RAID to test
    -D <value>            - dRAID data drives per group
    -S <value>            - dRAID distributed hot spares
    -R <value>            - RAID parity (raidz or dRAID)

The zpool_create, zpool_import, redundancy, replacement and fault
test groups have all been updated provide test coverage for the
dRAID feature.

Co-authored-by: Isaac Huang <he.huang@intel.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Don Brady <don.brady@delphix.com>
Co-authored-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10102
2020-11-13 13:51:51 -08:00
Mateusz Guzik
1a0b4f566c
G/C struct znode -> z_moved
The field is yet another leftover from unsupported zfs_znode_move.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11186
2020-11-10 12:42:47 -08:00
Ryan Moeller
8a9634e2f3 Remove redundant oid parameter to update_pages
The oid comes from the znode we are already passing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11176
2020-11-10 10:54:30 -08:00
Brian Behlendorf
8c7d604c62 Linux 5.10 compat: frame.h renamed objtool.h
In Linux 5.10 the linux/frame.h header was renamed linux/objtool.h.
Add a configure check to detect and use the correctly named header.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11085
2020-11-02 22:01:10 +00:00
Christian Schwarz
ab8c935ea6
zfs_vnops: make zfs_get_data OS-independent
Move zfs_get_data() in to platform-independent code. The only
platform-specific aspect of it is the way we release an inode 
(Linux) / vnode_t (FreeBSD). I am not aware of a platform that
could be supported by ZFS that couldn't implement zfs_rele_async 
itself. It's sibling zvol_get_data already is platform-independent.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #10979
2020-11-02 12:07:07 -08:00
Mateusz Guzik
09eb36ce3d
Introduce CPU_SEQID_UNSTABLE
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.

There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.

In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11142
2020-11-02 11:51:12 -08:00
Matthew Macy
8583540c6e
Consolidate zfs_holey and zfs_access
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11125
2020-10-31 09:40:08 -07:00
Mateusz Guzik
c4ede65bdf
zstd: track allocator statistics
Note that this only tracks sizes as requested by the caller.
Actual allocated space will almost always be bigger (e.g., rounded up to
the next power of 2 or page size). Additionally the allocated buffer may
be holding other areas hostage. Nonetheless, this is a starting point
for tracking memory usage in zstd.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11129
2020-10-30 15:26:10 -07:00
Matthew Macy
5fa356ea44
Remove UIO_ZEROCOPY functions structures
The original xuio zero copy functionality has always been unused 
on Linux and FreeBSD.  Remove this disabled code to avoid any
confusion and improve readability.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11124
2020-10-30 10:00:33 -07:00
Ryan Moeller
76d04993a6
Update references to nonexistent man pages in code
Refer to the correct section or alternative for FreeBSD and Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11132
2020-10-30 08:55:59 -07:00
Matthew Macy
e53d678d4a
Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD.  With a little refactoring they can be
moved to the common code which is what is done by this commit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11078
2020-10-21 14:08:06 -07:00
Ryan Moeller
485b50bb9e
Cross-platform acltype
The acltype property is currently hidden on FreeBSD and does not
reflect the NFSv4 style ZFS ACLs used on the platform.  This makes it
difficult to observe that a pool imported from FreeBSD on Linux has a
different type of ACL that is being ignored, and vice versa.

Add an nfsv4 acltype and expose the property on FreeBSD.

Make the default acltype nfsv4 on FreeBSD.

Setting acltype to an unhanded style is treated the same as setting
it to off.  The ACLs will not be removed, but they will be ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10520
2020-10-13 21:25:48 -07:00
Warner Losh
b302185a92
FreeBSD: make adjustments for the standalone environment
In FreeBSD, there are three compile environments that are supported:
user land, the kernel and the bootloader / standalone. Adjust the
headers to compile in the standalone environment. Limit kernel-only
items from view when _STANDALONE is defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Warner Losh <imp@FreeBSD.org>
Closes #10998
2020-10-13 21:05:49 -07:00
Christian Schwarz
701f656b97
dmu.h: remove stale declaration dmu_objset_snapshot_tmp
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11047
2020-10-13 16:46:00 -07:00
Christian Schwarz
61868bb14d
zil_parse: make callback parameters const
Code cleanup, a follow up commit to 4d55ea81.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11020
2020-10-09 09:34:54 -07:00
Brian Behlendorf
d0249a4bd0
Replace ZFS on Linux references with OpenZFS
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux.  Web links have been updated
to refer to https://github.com/openzfs/zfs.  The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11007
2020-10-08 20:10:13 -07:00
Ryan Moeller
4d55ea811d
Throw const on some strings
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.

Add const in a few places where it makes sense.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10997
2020-10-02 17:44:10 -07:00
Sebastian Gottschall
8a171ccd92
do a cyclic seek for unused memory objects in pool
In non regular use cases allocated memory might stay persistent in memory
pool. This small patch checks every minute if there are old objects which
can be released from memory pool.

Right now with regular use, the pool is checked for old objects on each
allocation attempt from this pool. so basically polling by its use. Now
consider what happens if someone writes a lot of files and stops use of
the volume or even unmounts it. So the code will no longer check if
objects can be released from the pool. Already allocated objects will
still stay in pool cache. this is no big issue for common use. But
someone discovered this issue while doing tests. personally i know this
behavior and I'm aware of it. Its no big issue. just a enhancement

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10938 
Closes #10969
2020-09-30 13:22:34 -07:00
Brian Behlendorf
96951e0327
Fix objtool configure check
The m4 objtool configure check can incorrectly fail because of a
missing header in the test.  This appears to be the result of a
recent kernel change and was observed on the Fedora 5.8.11-200
kernel.

  In file included from /home/fedora/zfs/build/objtool/objtool.c:75:
  ./arch/x86/include/asm/frame.h💯57: error: 'struct pt_regs'
      declared inside parameter list will not be visible outside
      of this definition or declaration [-Werror]

The consequence of this is that the "stack_frame_non_standard"
check is never run and HAVE_STACK_FRAME_NON_STANDARD is set
incorrectly which results in a build failure.  This change adds
the appropriate header to the "objtool" check so it now behaves
as intended.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10990
2020-09-28 16:40:50 -07:00
Matthew Macy
af20b97078
zfetch: Don't issue new streams when old have not completed
The current dmu_zfetch code implicitly assumes that I/Os complete
within min_sec_reap seconds. With async dmu and a readonly workload
(and thus no exponential backoff in operations from the "write
throttle") such as L2ARC rebuild it is possible to saturate the drives
with I/O requests. These are then effectively compounded with prefetch
requests.

This change reference counts streams and prevents them from being
recycled after their min_sec_reap timeout if they still have
outstanding I/Os.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10900
2020-09-27 17:08:38 -07:00
Matthew Macy
7b8363d7f0
FreeBSD: Add support for procfs_list
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
                           
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10890
2020-09-23 16:43:51 -07:00
Ryan Moeller
5f8a9e6a02
FreeBSD: Reduce stack usage of Lua
Use the same reduced buffer size for lauxlib that is used on Linux.

Fixes panic on HEAD in lua gsub test designed to exhaust stack space.

With this we can remove the special case to reserve more stack space
on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10959
2020-09-22 16:03:11 -07:00
George Wilson
c494aa7f57
vdev_ashift should only be set once
== Motivation and Context

The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.

In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.

== Description

This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes #10932
2020-09-18 12:13:47 -07:00
Ryan Moeller
7ead2be3d2
Rename acltype=posixacl to acltype=posix
Prefer acltype=off|posix, retaining the old names as aliases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10918
2020-09-16 12:26:06 -07:00
Toomas Soome
1db9e6e4e4
zfs label bootenv should store data as nvlist
nvlist does allow us to support different data types and systems.

To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.

Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10774
2020-09-15 15:42:27 -07:00
George Amanakis
085321621e
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.

This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.

L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.

When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.

Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.

Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
  an ARC buffer. However, prefetches may be issued in a way that
  arc_read_done() is bypassed. Instead, move the related code in
  l2arc_write_eligible() to account for those cases too.

Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 10:10:44 -07:00
Ryan Moeller
ebc4b52369
Avoid possibility of division by zero
When hz > 1000, msec / (1000 / hz) results in division by zero.

I found somewhere in FreeBSD using howmany(msec * hz, 1000) to convert
ms to ticks, avoiding the potential for a zero in the divisor.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10894
2020-09-08 11:39:16 -07:00
Don Brady
4f07282786
Avoid posting duplicate zpool events
Duplicate io and checksum ereport events can misrepresent that 
things are worse than they seem. Ideally the zpool events and the 
corresponding vdev stat error counts in a zpool status should be 
for unique errors -- not the same error being counted over and over. 
This can be demonstrated in a simple example. With a single bad 
block in a datafile and just 5 reads of the file we end up with a 
degraded vdev, even though there is only one unique error in the pool.

The proposed solution to the above issue, is to eliminate duplicates 
when posting events and when updating vdev error stats. We now save 
recent error events of interest when posting events so that we can 
easily check for duplicates when posting an error. 

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #10861
2020-09-04 10:34:28 -07:00
Matthew Ahrens
3808032489
nowait synctask must succeed
If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with
`dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC.
However, there is no way to notice or communicate this failure, so it's
extremely difficult to use this functionality correctly, and in fact
almost all callers use `ZFS_SPACE_CHECK_NONE`.

This commit removes the `zfs_space_check_t` argument from
`dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10855
2020-09-04 10:29:39 -07:00
Matthew Macy
ac6e5fb202
Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate
There are a number of places where cv_?_sig is used simply for
accounting purposes but the surrounding code has no ability to
cope with actually receiving a signal. On FreeBSD it is possible
to send signals to individual kernel threads so this could
enable undesirable behavior.

This patch adds routines on Linux that will do the same idle
accounting as _sig without making the task interruptible. On
FreeBSD cv_*_idle  are all aliases for cv_*

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10843
2020-09-03 20:04:09 -07:00
Ryan Moeller
7b4e27232d
Add 'zfs rename -u' to rename without remounting
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.

This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.

Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10839
2020-09-01 16:14:16 -07:00
Ryan Moeller
a2f944a140
zpool: Change base URL for ZFS messages to openzfs-docs
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10820
2020-08-26 21:43:06 -07:00
Sebastian Gottschall
184df27eef
Avoid symbol collision with in-kernel zstdlib
For Linux, when zfs is compiled as an in kernel static variant
and the in kernel zstd library is compiled statically into the kernel
a symbol collision will occur.  This wrapper header renames all
of the relevant zstd functions to avoid this problem.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10775
2020-08-24 12:20:41 -07:00
Chris McDonough
c3b03d0701
Appease GCC sprintf warnings found on Fedora 32/GCC 10.0.1
Increase the size of DDT_NAMELEN and MNT_LINE_MAX to appease GCC
snprintf truncation warnings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris McDonough <chrism@plope.com>
Closes #10712
Closes #10766
2020-08-24 10:32:59 -07:00
Ryan Moeller
6fe3498ca3
Import vdev ashift optimization from FreeBSD
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

 1. Existing pools created with devices that have different logical
    and physical block sizes, but were configured to use the logical
    block size (e.g. because the OS version used for pool construction
    reported the logical block size instead of the physical block
    size) will suddenly find that the vdev allocation size has
    increased. This can be easily tolerated for active members of
    the array, but ZFS would prevent replacement of a vdev with
    another identical device because it now appears that the smaller
    allocation size required by the pool is not supported by the new
    device.

 2. The device's physical block size may be too large to be supported
    by ZFS. The optimal allocation size for the vdev may be quite
    large. For example, a RAID controller may export a vdev that
    requires read-modify-write cycles unless accessed using 64k
    aligned/sized requests. ZFS currently has an 8k minimum block
    size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Macy <mmacy@freebsd.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10619
2020-08-21 12:53:17 -07:00
Matthew Macy
1c2725a157
FreeBSD: 11.x arc_stats compatibility
Removing other_size from arc_stats breaks top in 11.x jails
running on HEAD.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10745
2020-08-20 10:55:02 -07:00
Michael Niewöhner
10b3c7f5e4 Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:

- zstd: A basic ZStandard compression algorithm Available compression.
  Levels for zstd are zstd-1 through zstd-19, where the compression
  increases with every level, but speed decreases.

- zstd-fast: A faster version of the ZStandard compression algorithm
  zstd-fast is basically a "negative" level of zstd. The compression
  decreases with every level, but speed increases.

  Available compression levels for zstd-fast:
   - zstd-fast-1 through zstd-fast-10
   - zstd-fast-20 through zstd-fast-100 (in increments of 10)
   - zstd-fast-500 and zstd-fast-1000

For more information check the man page.

Implementation details:

Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.

The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers).  The upper bits are used to store the compression level.

It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.

All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables.  Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).

The userspace tools all use the combined/bit-shifted value.

Additional notes:

zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.

ZSTD is included with all current tests and new tests are added
as-needed.

Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born.  This is currently only used by zstd but can be
extended as needed.

Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-20 10:30:06 -07:00
Matthew Macy
5e7eaf8fbd
Add define to enable autotrim to default to on
In FreeBSD trim has defaulted to on for several
years. In order to minimize POLA violations on
import it's important to maintain this default
when importing vendored openzfs in to FreeBSD
base.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10719
2020-08-18 09:52:30 -07:00
Ryan Moeller
009cc8e884
Make zc_nvlist_src_size limit tunable
We limit the size of nvlists passed to the kernel so a user cannot make
the kernel do an unreasonably large allocation.  On FreeBSD this limit
was 128 kiB, which turns out to be a bit too small when doing some
operations involving a large number of datasets or snapshots, for
example replication.

Make this limit tunable, with a platform-specific auto default.
Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the
system limit on user wired memory, which allows it to scale depending
on system configuration.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Issue #6572 
Closes #10706
2020-08-18 09:33:55 -07:00
Matthew Ahrens
85ec5cbae2
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-17 20:04:04 -07:00
Matthew Ahrens
994de7e4b7
Remove KMC_KMEM and KMC_VMEM
`KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented
caches are `KMC_KVMEM`.

KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we
don't use kmalloc to back the SPL caches, instead we use kvmalloc
(KMC_KVMEM).  The flag, module parameter, /proc entries, and associated
code are removed.

KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to
vmalloc().  The flag, /proc entries, and associated code are removed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10673
2020-08-17 16:04:28 -07:00
Matthew Macy
5f1984f2f8
FreeBSD: fallback to /boot/ to look for zpool.cache
Up until now zpool.cache has always lived in /boot on FreeBSD.
For the sake of compatibility fallback to /boot if zpool.cache
isn't found in /etc/zfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10720
2020-08-17 14:43:47 -07:00
Ryan Moeller
ed726fb063
Move ZVOL_DIR back to zfs.h
This was previously moved because nothing else in-tree uses it, but
evidently DilOS uses it out of tree.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Ryan Moeller <freqlabs@freebsd.org>
Closes #10361 
Closes #10685
2020-08-11 13:12:12 -07:00
Allan Jude
8fb79fdddb
Change the error handling for invalid property values
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist

With a compression= property outside of the understood range:

Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.

After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10631
2020-08-01 08:41:31 -07:00
Matthew Ahrens
3442c2a02d
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-07-31 21:10:52 -07:00
Matthew Macy
27d96d2254
Rename refcount.h to zfs_refcount.h
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10620
2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos
843e9ca2e1
Introduce names for ZTHRs
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10630
2020-07-29 09:43:33 -07:00
Matthew Macy
5678d3f593
Prefix zfs internal endian checks with _ZFS
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10621
2020-07-28 13:02:49 -07:00
Matthew Macy
e64cc4954c
Refactor ccompile.h to not include system headers
This is a step toward being able to vendor the OpenZFS code in FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10625
2020-07-25 20:09:50 -07:00
Kyle Evans
bfafe1780a
Annotate unused parameters on inline definitions as such
* libspl: umem: These are obviously and intentionally unused; annotate 
  them as such to appease -Wunused-parameter builds that include this 
  header.

* sys/dmu.h: In this case, clear_on_evict_dbufp is only used for 
  ZFS_DEBUG builds, so annotate it as __maybe_unused to appease 
  -Wunused-parameter.


Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #10606
2020-07-23 17:41:48 -07:00
Ryan Moeller
f7a68f99d0
FreeBSD: Remove some code duplication in sysctl_os.c
Drop unnecessary redefinition's of several arcstat values.
Put missing extern declaration of arc_no_grow_shift in arc_impl.h.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10609
2020-07-23 17:35:34 -07:00
Matthew Ahrens
5dd92909c6
Adjust ARC terminology
The process of evicting data from the ARC is referred to as
`arc_adjust`.

This commit changes the term to `arc_evict`, which is more specific.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10592
2020-07-22 09:51:47 -07:00
Ryan Moeller
0421f257b2
FreeBSD: Add legacy arc_min and arc_max
These tunables were renamed from vfs.zfs.arc_min and 
vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max.
Add legacy compat tunables for the old names.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10579
2020-07-19 10:15:34 -07:00
Matthew Ahrens
6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
George Wilson
c15d36c674
Remove dependency on sharetab file and refactor sharing logic
== Motivation and Context

The current implementation of 'sharenfs' and 'sharesmb' relies on
the use of the sharetab file. The use of this file is os-specific
and not required by linux or freebsd. Currently the code must
maintain updates to this file which adds complexity and presents
a significant performance impact when sharing many datasets. In
addition, concurrently running 'zfs sharenfs' command results in
missing entries in the sharetab file leading to unexpected failures.

== Description

This change removes the sharetab logic from the linux and freebsd
implementation of 'sharenfs' and 'sharesmb'. It still preserves an
os-specific library which contains the logic required for sharing
NFS or SMB. The following entry points exist in the vastly simplified
libshare library:

- sa_enable_share -- shares a dataset but may not commit the change
- sa_disable_share -- unshares a dataset but may not commit the change
- sa_is_shared -- determine if a dataset is shared
- sa_commit_share -- notify NFS/SMB subsystem to commit the shares
- sa_validate_shareopts -- determine if sharing options are valid

The sa_commit_share entry point is provided as a performance enhancement
and is not required. The sa_enable_share/sa_disable_share may commit
the share as part of the implementation. Libshare provides a framework
for both NFS and SMB but some operating systems may not fully support
these protocols or all features of the protocol.

NFS Operation:
For linux, libshare updates /etc/exports.d/zfs.exports to add
and remove shares and then commits the changes by invoking
'exportfs -r'. This file, is automatically read by the kernel NFS
implementation which makes for better integration with the NFS systemd
service. For FreeBSD, libshare updates /etc/zfs/exports to add and
remove shares and then commits the changes by sending a SIGHUP to
mountd.

SMB Operation:
For linux, libshare adds and removes files in /var/lib/samba/usershares
by calling the 'net' command directly. There is no need to commit the
changes. FreeBSD does not support SMB.

== Performance Results

To test sharing performance we created a pool with an increasing number
of datasets and invoked various zfs actions that would enable and
disable sharing. The performance testing was limited to NFS sharing.
The following tests were performed on an 8 vCPU system with 128GB and
a pool comprised of 4 50GB SSDs:

Scale testing:
- Share all filesystems in parallel -- zfs sharenfs=on <dataset> &
- Unshare all filesystems in parallel -- zfs sharenfs=off <dataset> &

Functional testing:
- share each filesystem serially -- zfs share -a
- unshare each filesystem serially -- zfs unshare -a
- reset sharenfs property and unshare -- zfs inherit -r sharenfs <pool>

For 'zfs sharenfs=on' scale testing we saw an average reduction in time
of 89.43% and for 'zfs sharenfs=off' we saw an average reduction in time
of 83.36%.

Functional testing also shows a huge improvement:
- zfs share -- 97.97% reduction in time
- zfs unshare -- 96.47% reduction in time
- zfs inhert -r sharenfs -- 99.01% reduction in time

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Bryant G. Ly <bryangly@gmail.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-68690
Closes #1603
Closes #7692
Closes #7943
Closes #10300
2020-07-13 09:19:18 -07:00
Matthew Ahrens
e59a377a8f
filesystem_limit/snapshot_limit is incorrectly enforced against root
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit."  Two types of users are allowed to change
the limit:

1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`.  This works properly.

2. A user with elevated system privileges (e.g. root).  This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.

The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process.  This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.

This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged.  Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.

This problem only impacts Linux.  Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8226
Closes #10545
2020-07-11 17:18:02 -07:00
Mark Johnston
6e00561712 Add a "try" operation for range locks
zfs_rangelock_tryenter() bails immediately instead of waiting for the
lock to become available.  This will be used to resolve a deadlock in
the FreeBSD page-in code.  No functional change intended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #10519
2020-07-06 11:53:31 -07:00
Brian Behlendorf
9a49d3f3d3
Add device rebuild feature
The device_rebuild feature enables sequential reconstruction when
resilvering.  Mirror vdevs can be rebuilt in LBA order which may
more quickly restore redundancy depending on the pools average block
size, overall fragmentation and the performance characteristics
of the devices.  However, block checksums cannot be verified
as part of the rebuild thus a scrub is automatically started after
the sequential resilver completes.

The new '-s' option has been added to the `zpool attach` and
`zpool replace` command to request sequential reconstruction
instead of healing reconstruction when resilvering.

    zpool attach -s <pool> <existing vdev> <new vdev>
    zpool replace -s <pool> <old vdev> <new vdev>

The `zpool status` output has been updated to report the progress
of sequential resilvering in the same way as healing resilvering.
The one notable difference is that multiple sequential resilvers
may be in progress as long as they're operating on different
top-level vdevs.

The `zpool wait -t resilver` command was extended to wait on
sequential resilvers.  From this perspective they are no different
than healing resilvers.

Sequential resilvers cannot be supported for RAIDZ, but are
compatible with the dRAID feature being developed.

As part of this change the resilver_restart_* tests were moved
in to the functional/replacement directory.  Additionally, the
replacement tests were renamed and extended to verify both
resilvering and rebuilding.

Original-patch-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: John Poduska <jpoduska@datto.com>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10349
2020-07-03 11:05:50 -07:00
Matthew Ahrens
3c42c9ed84
Clean up OS-specific ARC and kmem code
OS-specific code (e.g. under `module/os/linux`) does not need to share
its code structure with any other operating systems.  In particular, the
ARC and kmem code need not be similar to the code in illumos, because we
won't be syncing this OS-specific code between operating systems.  For
example, if/when illumos support is added to the common repo, we would
add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of
this code.

Therefore, we can simplify the code in the OS-specific ARC and kmem
routines.

These changes do not impact system behavior, they are purely code
cleanup.  The changes are:

Arenas are not used on Linux or FreeBSD (they are always `NULL`), so
`heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along
with code that uses them.

In `arc_available_memory()`:
 * `desfree` is unused, remove it
 * rename `freemem` to avoid conflict with pre-existing `#define`
 * remove checks related to arenas
 * use units of bytes, rather than converting from bytes to pages and
   then back to bytes

`SPL_KMEM_CACHE_REAP` is unused, remove it.

`skc_reap` is unused, remove it.

The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove
it.

`vmem_size()` and associated type and macros are unused, remove them.

In `arc_memory_throttle()`, use a less confusing variable name to store
the result of `arc_free_memory()`.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10499
2020-06-29 09:01:07 -07:00
Arvind Sankar
4d8e68c42f Avoid installing kernel headers on FreeBSD
The kernel headers are installed for DKMS on linux, so don't install
them unless we're building on linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10506
2020-06-27 17:40:14 -07:00
Arvind Sankar
6b99fc0620 Fixes for make dist
Reduce the usage of EXTRA_DIST. If files are conditionally included in
_SOURCES, _HEADERS etc, automake is smart enough to dist all files that
could possibly be included, but this does not apply to EXTRA_DIST,
resulting in make dist depending on the configuration.

Add some files that were missing altogether in various Makefile's.

The changes to disted files in this commit (excluding deleted files):

+./cmd/zed/agents/README.md
+./etc/init.d/README.md
+./lib/libspl/os/freebsd/getexecname.c
+./lib/libspl/os/freebsd/gethostid.c
+./lib/libspl/os/freebsd/getmntany.c
+./lib/libspl/os/freebsd/mnttab.c
-./lib/libzfs/libzfs_core.pc
-./lib/libzfs/libzfs.pc
+./lib/libzfs/os/freebsd/libzfs_compat.c
+./lib/libzfs/os/freebsd/libzfs_fsshare.c
+./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c
+./lib/libzfs/os/freebsd/libzfs_zmount.c
+./lib/libzutil/os/freebsd/zutil_compat.c
+./lib/libzutil/os/freebsd/zutil_device_path_os.c
+./lib/libzutil/os/freebsd/zutil_import_os.c
+./module/lua/README.zfs
+./module/os/linux/spl/README.md
+./tests/README.md
+./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh
+./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh
+./tests/zfs-tests/tests/functional/inheritance/README.config
+./tests/zfs-tests/tests/functional/inheritance/README.state
+./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh
+./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10501
2020-06-26 14:20:02 -07:00
Matthew Ahrens
67c0f0dedc
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5527, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5527.

Note: commit 3ec34e5527 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 10:42:27 -07:00
Arvind Sankar
7513807320 Drop unnecessary srcdir paths
There's no need to specify the srcdir explicitly in _HEADERS and
EXTRA_DIST.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10493
2020-06-24 18:20:18 -07:00
Ryan Moeller
9192f27c1d
Add zfs_multihost_interval tunable handler for FreeBSD
This tunable required a handler to be implemented for
ZFS_MODULE_PARAM_CALL.

Add the handler so the tunable can be declared in common code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10490
2020-06-23 13:32:42 -07:00
Arvind Sankar
0ce2de637b Add prototypes
Add prototypes/move prototypes to header files.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:32 -07:00
Arvind Sankar
c3fe42aabd Remove dead code
Delete unused functions.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes #10470
2020-06-18 12:21:18 -07:00
Matthew Macy
8056a75672
Disambiguate condvar API contract
On Illumos callers of cv_timedwait and cv_timedwait_hires
can't distinguish between whether or not the cv was signaled
or the call timed out. Illumos handles this (for some definition
of handles) by calling cv_signal in the return path if we were
signaled but the return value indicates instead that we timed
out. This would make sense if it were possible to query the the
cv for its net signal disposition. However, this isn't possible
and, in spite of the fact that there are places in the code that
clearly take a different and incompatible path if a timeout value
is indicated, this distinction appears to be rather subtle to most
developers. This problem is further compounded by the fact that on
Linux, calling cv_signal in the return path wouldn't even do the
right thing unless there are other waiters.

Since it is possible for the caller to independently determine how
much time is remaining but it is not possible to query if the cv
was in fact signaled, prioritizing signalling over timeout seems
like a cleaner solution. In addition, judging from usage patterns
within the code itself, it is also less error prone.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10471
2020-06-18 10:17:50 -07:00
Matthew Macy
7564073ed6
Add abd_cache_reap_now for abd_chunk_cache users
Apparently missed in the initial port integration was
the need to reap the abd_chunk_cache on FreeBSD. This
change addresses that oversight.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10474
2020-06-17 21:44:13 -07:00
Jorgen Lundman
cd07d7c83f
drr_begin: can't forward declare untagged struct
When compiling with Clang++ it does not allow for untagged structs, so
struct ddr_begin needs to be declared before the struct that uses it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10453
2020-06-16 11:57:04 -07:00
Jorgen Lundman
d366c8fd7a
Make struct vdev_disk_t be platform private
Linux defines different vdev_disk_t members to macOS, but they are
only used in vdev_disk.c so move the declaration there.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10452
2020-06-16 11:43:33 -07:00
George Amanakis
f2edc0078f
Fix gcc10.1 truncation error
gcc10.1 complains with:

../../include/sys/dmu.h:373:24: error: ‘%s’ directive output may be
truncated writing up to 95 bytes into a region of size 75
[-Werror=format-truncation=]
  373 | #define DMU_POOL_DDT   "DDT-%s-%s-%s"
      |                        ^~~~~~~~~~~~~~
../../module/zfs/ddt.c:256:37: note: in expansion of macro
‘DMU_POOL_DDT’
  256 |  (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
      |                                     ^~~~~~~~~~~~
../../include/sys/dmu.h:373:32: note: format string is defined here
  373 | #define DMU_POOL_DDT   "DDT-%s-%s-%s"
      |                                ^~
../../module/zfs/ddt.c:256:9: note: ‘snprintf’ output 7 or more bytes
(assuming 102) into a destination of size 80
  256 |  (void) snprintf(name, DDT_NAMELEN, DMU_POOL_DDT,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  257 |      zio_checksum_table[ddt->ddt_checksum].ci_name,
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  258 |      ddt_ops[type]->ddt_op_name, ddt_class_name[class]);
      |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Increasing DTT_NAMELEN fixes it.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10433
2020-06-13 11:02:00 -07:00
Brian Atkinson
e08b993396
Removing ZERO_PAGE abd_alloc_zero_scatter
For MIPS architectures on Linux the ZERO_PAGE macro references
empty_zero_page, which is exported as a GPL symbol. The call to
ZERO_PAGE in abd_alloc_zero_scatter has been removed and a single
zero'd page is now allocated for each of the pages in abd_zero_scatter
in the kernel ABD code path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10428
2020-06-10 17:54:11 -07:00
Andrea Gelmini
dd4bc569b9
Fix typos
Correct various typos in the comments and tests.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes #10423
2020-06-09 21:24:09 -07:00
Matthew Ahrens
7bcb7f0840
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
2020-06-09 10:41:01 -07:00
George Amanakis
b7654bd794
Trim L2ARC
The l2arc_evict() function is responsible for evicting buffers which
reference the next bytes of the L2ARC device to be overwritten. Teach
this function to additionally TRIM that vdev space before it is
overwritten if the device has been filled with data. This is done by
vdev_trim_simple() which trims by issuing a new type of TRIM,
TRIM_TYPE_SIMPLE.

We also implement a "Trim Ahead" feature. It is a zfs module parameter,
expressed in % of the current write size. This trims ahead of the
current write size. A minimum of 64MB will be trimmed. The default is 0
which disables TRIM on L2ARC as it can put significant stress to
underlying storage devices. To enable TRIM on L2ARC we set
l2arc_trim_ahead > 0.

We also implement TRIM of the whole cache device upon addition to a
pool, pool creation or when the header of the device is invalid upon
importing a pool or onlining a cache device. This is dependent on
l2arc_trim_ahead > 0. TRIM of the whole device is done with
TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t.
We save the TRIM state for the whole device and the time of completion
on-disk in the header, and restore these upon L2ARC rebuild so that
zpool status -t can correctly report them. Whole device TRIM is done
asynchronously so that the user can export of the pool or remove the
cache device while it is trimming (ie if it is too slow).

We do not TRIM the whole device if persistent L2ARC has been disabled by
l2arc_rebuild_enabled = 0 because we may not want to lose all cached
buffers (eg we may want to import the pool with
l2arc_rebuild_enabled = 0 only once because of memory pressure). If
persistent L2ARC has been disabled by setting the module parameter
l2arc_rebuild_blocks_min_l2size to a value greater than the size of the
cache device then the whole device is trimmed upon creation or import of
a pool if l2arc_trim_ahead > 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam D. Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #9713
Closes #9789 
Closes #10224
2020-06-09 10:15:08 -07:00
Pawel Jakub Dawidek
529246df96
Restore support for in-kernel ZFS ioctls
In Illumos it is possible to call ioctl functions from within the
kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support
that, but it doesn't hurt to keep it around, as all the code is there.

Before this commit it was a dead code and zc_iflags was always zero.
Restore this functionality by allowing to pass a flag to the
zfsdev_ioctl_common() function.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #10417
2020-06-08 13:57:22 -07:00
Jorgen Lundman
c9e319faae
Replace sprintf()->snprintf() and strcpy()->strlcpy()
The strcpy() and sprintf() functions are deprecated on some platforms.
Care is needed to ensure correct size is used.  If some platforms
miss snprintf, we can add a #define to sprintf, likewise strlcpy().

The biggest change is adding a size parameter to zfs_id_to_fuidstr().

The various *_impl_get() functions are only used on linux and have
not yet been updated.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10400
2020-06-07 11:42:12 -07:00
Ryan Moeller
60265072e0
Improve compatibility with C++ consumers
C++ is a little picky about not using keywords for names, or string
constness.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10409
2020-06-06 12:54:04 -07:00
Allan Jude
4547fc4e07
Connect dataset_kstats for FreeBSD
Expand the FreeBSD spl for kstats to support all current types

Move the dataset_kstats_t back to zvol_state_t from zfs_state_os_t
now that it is common once again

```
kstat.zfs/mypool.dataset.objset-0x10b.nunlinked: 0
kstat.zfs/mypool.dataset.objset-0x10b.nunlinks: 0
kstat.zfs/mypool.dataset.objset-0x10b.nread: 150528
kstat.zfs/mypool.dataset.objset-0x10b.reads: 48
kstat.zfs/mypool.dataset.objset-0x10b.nwritten: 134217728
kstat.zfs/mypool.dataset.objset-0x10b.writes: 1024
kstat.zfs/mypool.dataset.objset-0x10b.dataset_name: mypool/datasetname
```

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10386
2020-06-05 17:17:02 -07:00
Jorgen Lundman
081de9a86d
Restore avl_update() calls and related functions
The macOS kmem implementation uses avl_update() and related
functions.  These same function exist in the Solaris AVL code but
were removed because they were unused.  Restore them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10390
2020-06-03 09:49:32 -07:00
Brian Atkinson
fb822260b1
Gang ABD Type
Adding the gang ABD type, which allows for linear and scatter ABDs to
be chained together into a single ABD.

This can be used to avoid doing memory copies to/from ABDs. An example
of this can be found in vdev_queue.c in the vdev_queue_aggregate()
function.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Brian <bwa@clemson.edu>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10069
2020-05-20 18:06:09 -07:00
Ryan Moeller
d2782af461
Fix ZVOL_DIR
We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct.

Move the definition to the file where it is needed, and define it as
/dev/zvol/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10337
2020-05-16 10:10:38 -07:00
Brian Behlendorf
2ade659eb4
Fix abd_enter/exit_critical wrappers
Commit fc551d7 introduced the wrappers abd_enter_critical() and
abd_exit_critical() to mark critical sections.  On Linux these are
implemented with the local_irq_save() and local_irq_restore() macros
which set the 'flags' argument when saving.  By wrapping them with
a function the local variable is no longer set by the macro and is
no longer properly restored.

Convert abd_enter_critical() and abd_exit_critical() to macros to
resolve this issue and ensure the flags are properly restored.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10332
2020-05-14 20:45:16 -07:00
Matthew Ahrens
8b240f14f9
remove unneeded member drc_err of dmu_recv_cookie_t
The member drc_err of dmu_recv_cookie_t is used only locally in
receive_read, so we can replace it with a local variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10319
2020-05-14 12:10:29 -07:00
Brian Atkinson
fc551d7efb
Combine OS-independent ABD Code into Common Source File
Reorganizing ABD code base so OS-independent ABD code has been placed
into a common abd.c file. OS-dependent ABD code has been left in each
OS's ABD source files, and these source files have been renamed to
abd_os.

The OS-independent ABD code is now under:
module/zfs/abd.c
With the OS-dependent code in:
module/os/linux/zfs/abd_os.c
module/os/freebsd/zfs/abd_os.c

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10293
2020-05-10 12:23:52 -07:00
George Amanakis
657fd33bcf
Improvements on persistent L2ARC
Functional changes:

We implement refcounts of log blocks and their aligned size on the
cache device along with two corresponding arcstats. The refcounts are
reflected in the header of the device and provide valuable information
as to whether log blocks are accounted for correctly. These are
dynamically adjusted as log blocks are committed/evicted. zdb also uses
this information in the device header and compares it to the
corresponding values as reported by dump_l2arc_log_blocks() which
emulates l2arc_rebuild(). If the refcounts saved in the device header
report higher values, zdb exits with an error. For this feature to work
correctly there should be no active writes on the device. This is also
employed in the tests of persistent L2ARC. We extend the structure of
the cache device header by adding the two new variables mirroring the
refcounts after the existing variables to preserve backward
compatibility in terms of persistent L2ARC.

1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which
   reflect the total aligned size of log blocks on the device. This is
   also reflected in the header of the cache device as "dh_lb_asize".
2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count"
   which reflect the total number of L2ARC log blocks present on cache
   devices.  It is also reflected in the header of the cache device as
   "dh_lb_count".

In l2arc_rebuild_vdev() if the amount of committed log entries in a log
block is 0 and the device header is valid we update the device header.
This will facilitate trimming of the whole device in this case when
TRIM for L2ARC is implemented.

Improve loop protection in l2arc_rebuild() by using the starting offset
of the payload of each log block instead of the starting offset of the
log block.

If the zio in l2arc_write_buffers() fails, restore the lbps array in the
header of the device to its previous state in l2arc_write_done().

If l2arc_rebuild() ends the rebuild process without restoring any L2ARC
log blocks in ARC and without any other error, this means that the lbps
array in the header is pointing to non-existent or invalid log blocks.
Reset the device header in this case.

In l2arc_rebuild() change the zfs_dbgmsg messages to
spa_history_log_internal() making them user visible with zpool history
command.

Non-functional changes:

Make the first test in persistent L2ARC use `zdb -lll` to increase
coverage in `zdb.c`.

Rename psize with asize when referring to log blocks, since
L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also
rename dh_log_blk_entries to dh_log_entries to make it clear that
it is a mirror of l2ad_log_entries. Added comments for both changes.

Fix inaccurate comments for example in l2arc_log_blk_restore().

Add asserts at the end in l2arc_evict() and l2arc_write_buffers().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10228
2020-05-07 16:34:03 -07:00
Paul Dagnelie
108a454a46
Add support for boot environment data to be stored in the label
Modern bootloaders leverage data stored in the root filesystem to 
enable some of their powerful features. GRUB specifically has a grubenv 
file which can store large amounts of configuration data that can be 
read and written at boot time and during normal operation. This allows 
sysadmins to configure useful features like automated failover after 
failed boot attempts. Unfortunately, due to the Copy-on-Write nature 
of ZFS, the standard behavior of these tools cannot handle writing to
ZFS files safely at boot time. We need an alternative way to store 
data that allows the bootloader to make changes to the data.

This work is very similar to work that was done on Illumos to enable 
similar functionality in the FreeBSD bootloader. This patch is different 
in that the data being stored is a raw grubenv file; this file can store 
arbitrary variables and values, and the scripting provided by grub is 
powerful enough that special structures are not required to implement 
advanced behavior.

We repurpose the second padding area in each label to store the grubenv 
file, protected by an embedded checksum. We add two ioctls to get and 
set this data, and libzfs_core and libzfs functions to access them more 
easily. There are no direct command line interfaces to these functions; 
these will be added directly to the bootloader utilities.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10009
2020-05-07 09:36:33 -07:00
Matthew Ahrens
196bee4cfd
Remove deduplicated send/receive code
Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such
streams) are deprecated.  Deduplicated send streams can be received by
first converting them to non-deduplicated with the `zstream redup`
command.

This commit removes the code for sending and receiving deduplicated send
streams.  `zfs send -D` will now print a warning, ignore the `-D` flag,
and generate a regular (non-deduplicated) send stream.  `zfs receive` of
a deduplicated send stream will print an error message and fail.

The resulting code simplification (especially in the kernel's support
for receiving dedup streams) should help enable future performance
enhancements.

Several new tests are added which leverage `zstream redup`.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Issue #7887
Issue #10117
Issue #10156
Closes #10212
2020-04-23 10:06:57 -07:00
Matthew Ahrens
32d805c3e2
Use a struct to organize metaslab-group-allocator fields
Each metaslab group (of which there is one per top-level vdev) has
several (4, by default) "metaslab group allocators".  Each "allocator"
has its own metaslab that it prefers to allocate from (the "primary"
allocator), and each can perform allocations concurrently with the other
allocators.  In addition to the primary metaslab, there are several
other fields that need to be tracked separately for each allocator.
These are currently stored as several arrays in the metaslab_group_t,
each array indexed by allocator number.

This change organizes all the metaslab-group-allocator-specific fields
into a new struct, metaslab_group_allocator_t.  The metaslab_group_t now
needs only one array indexed by the allocator number - which contains
the metaslab_group_allocator_t's.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10213
2020-04-22 10:26:56 -07:00
George Amanakis
77f6826b83
Persistent L2ARC
This commit makes the L2ARC persistent across reboots. We implement
a light-weight persistent L2ARC metadata structure that allows L2ARC
contents to be recovered after a reboot. This significantly eases the
impact a reboot has on read performance on systems with large caches.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Saso Kiselkov <skiselkov@gmail.com>
Co-authored-by: Jorgen Lundman <lundman@lundman.net>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Ported-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #925 
Closes #1823 
Closes #2672 
Closes #3744 
Closes #9582
2020-04-10 10:33:35 -07:00
Ryan Moeller
36a6e2335c
Don't ignore zfs_arc_max below allmem/32
Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower
than the default allmem/32 zfs_arc_max can also be set lower.

Add warning messages when tunables are being ignored.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10157
Closes #10158
2020-04-09 15:39:48 -07:00
Matthew Macy
8b27e08ed8
Add separate field for indicating that spa is in middle of split
By default it's not possible to open a device already owned by an
active vdev. It's necessary to make an exception to this for vdev
split. The FreeBSD platform code will make an exception if
spa_is splitting is set to to true.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10178
2020-04-09 09:59:31 -07:00
Ryan Moeller
7e3df9db12
Finish refactoring for ZFS_MODULE_PARAM_CALL
Linux and FreeBSD have different parameters for tunable proc handler.
This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL
macro.

To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create
per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS.

With the declarations wired up we discovered an incorrect scope prefix
for spa_slop_shift, so this is now fixed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10179
2020-04-07 10:06:22 -07:00