Commit Graph

2442 Commits

Author SHA1 Message Date
Mateusz Guzik
09eb36ce3d
Introduce CPU_SEQID_UNSTABLE
Current CPU_SEQID users don't care about possibly changing CPU ID, but
enclose it within kpreempt disable/enable in order to fend off warnings
from Linux's CONFIG_DEBUG_PREEMPT.

There is no need to do it. The expected way to get CPU ID while allowing
for migration is to use raw_smp_processor_id.

In order to make this future-proof this patch keeps CPU_SEQID as is and
introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers
explicitly want this behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #11142
2020-11-02 11:51:12 -08:00
Matthew Macy
8583540c6e
Consolidate zfs_holey and zfs_access
The zfs_holey() and zfs_access() functions can be made common
to both FreeBSD and Linux.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11125
2020-10-31 09:40:08 -07:00
Matthew Macy
5fa356ea44
Remove UIO_ZEROCOPY functions structures
The original xuio zero copy functionality has always been unused 
on Linux and FreeBSD.  Remove this disabled code to avoid any
confusion and improve readability.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11124
2020-10-30 10:00:33 -07:00
Alexander Motin
1199c3e8fb
Yield periodically when rebuilding L2ARC
L2ARC devices of several terabytes filled with 4KB blocks may take 15
minutes to rebuild.  Due to the way L2ARC log reading is implemented
it is quite likely that for all that time rebuild thread will never
sleep.  At least on FreeBSD kernel threads have absolute priority and
can not be preempted by threads with lower priorities.  If some thread
is also bound to that specific CPU it may not get any CPU time for all
the 15 minutes.

Reviewed-by: Cedric Berger <cedric@precidata.com>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11116
2020-10-30 08:57:54 -07:00
Ryan Moeller
76d04993a6
Update references to nonexistent man pages in code
Refer to the correct section or alternative for FreeBSD and Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11132
2020-10-30 08:55:59 -07:00
Ryan Moeller
eb02a4c6fb
Add missing zfs_arc_evict_batch_limit tunable
It's even documented already.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11094
2020-10-22 10:18:26 -07:00
Matthew Macy
e53d678d4a
Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD.  With a little refactoring they can be
moved to the common code which is what is done by this commit.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11078
2020-10-21 14:08:06 -07:00
Adam D. Moss
666aa69f32
Non-l2arc pool reads shouldn't be l2arc misses
The current l2_misses accounting behavior treats all reads to pools 
without a configured l2arc as an l2arc miss, IFF there is at least 
one other pool on the system which does have an l2arc configured.

This makes it extremely hard to tune for an improved l2arc hit/miss 
ratio because this ratio will be modulated by reads from pools which 
do not (and should not) have l2arc devices; its upper limit will 
depend on the ratio of reads from l2arc'd pools and non-l2arc'd pools.

This PR prevents ARC reads affecting l2arc stats (n.b. l2_misses is 
the only relevant one) where the target spa doesn't have an l2arc.

Includes new test - l2arc_l2miss_pos.ksh

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #10921
2020-10-20 11:39:52 -07:00
Don Brady
dff71c7936
Ignore special vdev ashift for spa ashift min/max
The removal of a vdev in the normal class would fail if there was a 
special or deup vdev that had a different ashift than the vdevs in 
the normal class.

Moved the initialization of spa_min_ashift / spa_max_ashift from 
vdev_open so that it occurs after the vdev allocation bias was 
initialized (i.e. after vdev_load).

Caveat -- In order to remove a special/dedup vdev it must have the 
same ashift as the normal pool vdevs.  This could perhaps be lifted 
in the future (i.e. for the case where there is ample space in any 
surviving special class vdevs)

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #9363
Closes #9364
Closes #11053
2020-10-15 14:45:16 -07:00
Christian Schwarz
15a4ca4620
Fix crash caused by invalid snapshot names in redactnvl
This is a follow up fix for commit 0fdd6106bb.  The VERIFY is
only true when we haven't hit an error code path.  See added
test case for a reproducer.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11048
2020-10-14 14:04:19 -07:00
Paul Dagnelie
6a60ef80e2
Fix incorrect deletion order in range_tree_add_impl gap case
After a side-effectful call like add or remove, references to range 
segs stored in btrees can no longer be used safely.  We move the 
remove call to just before the reinsertion call so that the seg 
remains valid for as long as we need it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11044 
Closes #11056
2020-10-14 08:59:54 -07:00
Matthew Macy
57dc5d42b1
dmu_zfetch: don't leak unreferenced stream when zfetch is freed
Currently streams are only freed when:
  - They have no referencing zfetch and and their I/O references
    go to zero.
  - They are more than 2s old and a new I/O request comes in on
    the same zfetch.

This means that we will leak unreferenced streams when their zfetch
structure is freed.

This change checks the reference count on a stream at zfetch free
time. If it is zero we free it immediately. If it has remaining
references we allow the prefetch callback to free it at I/O
completion time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11052
2020-10-13 21:03:36 -07:00
Ryan Moeller
7dfc56d866
Expose zfetch_max_idistance tunable
FreeBSD had this value tunable before the switch to the new OpenZFS.
The tunable name has changed, breaking legacy compat.

Restore legacy compat for this tunable, properly expose the tunable
with the new name on all platforms, and document it in
zfs-module-parameters(5).

While here, clean up the documentation for zfetch_max_distance a bit.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #11038
2020-10-13 09:32:34 -07:00
Christian Schwarz
61868bb14d
zil_parse: make callback parameters const
Code cleanup, a follow up commit to 4d55ea81.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #11020
2020-10-09 09:34:54 -07:00
Brian Behlendorf
d0249a4bd0
Replace ZFS on Linux references with OpenZFS
This change updates the documentation to refer to the project
as OpenZFS instead ZFS on Linux.  Web links have been updated
to refer to https://github.com/openzfs/zfs.  The extraneous
zfsonlinux.org web links in the ZED and SPL sources have been
dropped.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11007
2020-10-08 20:10:13 -07:00
Chuck Tuffli
a8fc1b8743
Fix ubsan: shift exponent is too large
When running libzpool with the Undefined Behavior Sanitizer (ubsan)
enabled, a zpool create causes a run-time error:

    module/zfs/vdev_label.c:600:14: runtime error: shift exponent 64 is
    too large for 64-bit type 'long long unsigned int'`

in vdev_config_generate()

Fix is to convert vdev_removal_max_span to its base-2 logarithm, using
highbit64(), and then compare the "shifts".

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Chuck Tuffli <ctuffli@gmail.com>
Closes #9744
Closes #11024
2020-10-08 16:37:27 -07:00
George Amanakis
a76e4e6761
Make L2ARC tests more robust
Instead of relying on arbitrary timers after pool export/import or cache
device off/online rely on arcstats. This makes the L2ARC tests more
robust. Also cleanup some functions related to persistent L2ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10983
2020-10-05 15:29:05 -07:00
Ryan Moeller
4d55ea811d
Throw const on some strings
In C, const indicates to the reader that mutation will not occur.
It can also serve as a hint about ownership.

Add const in a few places where it makes sense.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10997
2020-10-02 17:44:10 -07:00
John Poduska
5b525165e9
Mismatched nvlist names in zfs_keys_send_space
This causes "zfs send -vt ..." to fail with:

    cannot resume send: Unknown error 1030

It turns out that some of the name/value pairs in the verification
list for zfs_ioc_send_space(), zfs_keys_send_space, had the wrong
name, so the ioctl got kicked out in zfs_check_input_nvpairs().
Update the names accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: John Poduska <jpoduska@datto.com>
Closes #10978
2020-10-02 17:40:46 -07:00
Matthew Macy
1cb8202b1b
Eliminate gratuitous bzeroing in dbuf_stats_hash_table_data
`dbuf_stats_hash_table_data` can take much longer than it needs to
by repeatedly bzeroing its buffer when in fact the buffer only needs
to be NULL terminated.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10993
2020-09-30 13:24:38 -07:00
Sebastian Gottschall
8a171ccd92
do a cyclic seek for unused memory objects in pool
In non regular use cases allocated memory might stay persistent in memory
pool. This small patch checks every minute if there are old objects which
can be released from memory pool.

Right now with regular use, the pool is checked for old objects on each
allocation attempt from this pool. so basically polling by its use. Now
consider what happens if someone writes a lot of files and stops use of
the volume or even unmounts it. So the code will no longer check if
objects can be released from the pool. Already allocated objects will
still stay in pool cache. this is no big issue for common use. But
someone discovered this issue while doing tests. personally i know this
behavior and I'm aware of it. Its no big issue. just a enhancement

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10938 
Closes #10969
2020-09-30 13:22:34 -07:00
Ryan Moeller
c0bd2e0fe2
Drop references when skipping dmu_send due to EXDEV
When an invalid incremental send is requested where the "to" ds is
before the "from" ds, make sure to drop the reference to the pool
and the dataset before returning the error.

Add an assert on FreeBSD to make sure we don't hold any locks after
returning from an ioctl.

Add some test coverage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10919
2020-09-30 13:19:49 -07:00
Matthew Macy
af20b97078
zfetch: Don't issue new streams when old have not completed
The current dmu_zfetch code implicitly assumes that I/Os complete
within min_sec_reap seconds. With async dmu and a readonly workload
(and thus no exponential backoff in operations from the "write
throttle") such as L2ARC rebuild it is possible to saturate the drives
with I/O requests. These are then effectively compounded with prefetch
requests.

This change reference counts streams and prevents them from being
recycled after their min_sec_reap timeout if they still have
outstanding I/Os.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10900
2020-09-27 17:08:38 -07:00
Adam D. Moss
acfd2d4641
Add DB_RF_NOPREFETCH to dbuf_read()s in dnode.c
Prefetching of dnodes in dbuf_read() can cause significant mutex 
contention for some workloads and isn't very helpful.  This is  
because we already get 32 dnodes for each block read, and when 
iterating over a directory we prefetch the dnodes in the directory.
Disable this prefetching to prevent the lock contention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Submitted-by: Adam Moss <c@yotes.com>
Submitted-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Adam Moss <c@yotes.com>
Closes #10877 
Closes #10953
2020-09-25 13:49:22 -07:00
Ryan Moeller
863e38453e Prune dead branch reported by Coverity
wkey is NULL at every `goto error;`.
dcp is never NULL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10884
2020-09-25 13:11:53 -07:00
Christian Schwarz
a5c77dc4d5
zfs_log_write: simplify data copying code for WR_COPIED records
lr_write_t records that are WR_COPIED have the record data directly
appended to them (see lr_write_t type definition).

The data is copied from the debuf using dmu_read_by_dnode.

This function was called, only for WR_COPIED records, as part of a
short-circuiting if-statement's if-expression.

I found this side-effectful call to dmu_read_by_dnode pretty
hard to spot.
This patch improves readability by moving the call to its own line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #10956
2020-09-25 13:06:34 -07:00
Matthew Macy
7b8363d7f0
FreeBSD: Add support for procfs_list
The procfs_list interface is required by several kstats. Implement
this functionality for FreeBSD to provide access to these kstats.
                           
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10890
2020-09-23 16:43:51 -07:00
Paul Dagnelie
20dfe8cd3b
Don't set numobjs to UINT64_MAX or near it
Resolves an issue with `zfs send` streams from 0.8.4 which
prevents them from being received by versions < 0.7.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10911 
Closes #10916
2020-09-22 16:16:07 -07:00
George Amanakis
c6f5e9d92f
Restore clearing of L2CACHE flag in arc_read_done()
Commit 45152dc removed clearing of L2CACHE flag in arc_read_done() and
moved related code in l2arc_write_eligible(). After careful code
inspection arc_read_done() is not bypassed in the case of prefetches.
Thus restore the old behavior.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: adam moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10951
2020-09-22 16:08:05 -07:00
George Wilson
c494aa7f57
vdev_ashift should only be set once
== Motivation and Context

The new vdev ashift optimization prevents the removal of devices when
a zfs configuration is comprised of disks which have different logical
and physical block sizes. This is caused because we set 'spa_min_ashift'
in vdev_open and then later call 'vdev_ashift_optimize'. This would
result in an inconsistency between spa's ashift calculations and that
of the top-level vdev.

In addition, the optimization logical ignores the overridden ashift
value that would be provided by '-o ashift=<val>'.

== Description

This change reworks the vdev ashift optimization so that it's only
set the first time the device is configured. It still allows the
physical and logical ahsift values to be set every time the device
is opened but those values are only consulted on first open.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
External-Issue: DLPX-71831
Closes #10932
2020-09-18 12:13:47 -07:00
Pavel Snajdr
9569c31161 Fix stack frame size: dnode_dirty_l1range()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:55 -07:00
Pavel Snajdr
a1c5578ce0 dmu_redact_snap: fix possible memleak
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:45 -07:00
Pavel Snajdr
8c0b16e6e9 Fix stack frame size: dmu_redact_snap()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:35 -07:00
Pavel Snajdr
c95625769d Fix stack frame size: spa_livelist_delete_cb()
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #10879
2020-09-15 15:55:03 -07:00
Toomas Soome
1db9e6e4e4
zfs label bootenv should store data as nvlist
nvlist does allow us to support different data types and systems.

To encapsulate user data to/from nvlist, the libzfsbootenv library is
provided.

Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10774
2020-09-15 15:42:27 -07:00
George Amanakis
085321621e
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.

This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.

L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.

When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.

Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.

Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
  an ARC buffer. However, prefetches may be issued in a way that
  arc_read_done() is bypassed. Instead, move the related code in
  l2arc_write_eligible() to account for those cases too.

Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.

Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 10:10:44 -07:00
Olaf Faaland
a74259cea0
Initialize mmp_last_write when the mmp thread starts
A great deal of time may go by between when mmp_init() is called and
the MMP thread starts, particularly if there are bad devices, because
there is I/O checking configs etc.  If this time is too long,

    (gethrtime() - mmp_last_write) > mmp_fail_ns

at the time the MMP thread starts.  If MMP is configured to suspend
the pool, the pool will be suspended immediately.

This can be seen in issue #10838

The value of mmp_last_write doesn't matter before the mmp thread
starts.  To give the MMP thread time to issue and land MMP writes,
initialize mmp_last_write when the MMP thread starts.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #10873
2020-09-09 10:12:54 -07:00
George Amanakis
feb3a7eef1
Introduce ZFS module parameter l2arc_mfuonly
In certain workloads it may be beneficial to reduce wear of L2ARC
devices by not caching MRU metadata and data into L2ARC. This commit
introduces a new tunable l2arc_mfuonly for this purpose.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10710
2020-09-08 11:44:37 -07:00
Toomas Soome
189272f78a
dnode_special_open() error: unchecked function return 'zrl_tryenter'
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10876
2020-09-08 11:36:52 -07:00
Matthew Macy
7432d29760
FreeBSD: reduce priority of ZIO_TASKQ_ISSUE writes by a larger value
On FreeBSD, if priorities divided by four (RQ_PPQ) are equal then
a difference between them is insignificant. In other words,
incrementing pri by only one as on Linux is insufficient.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10872
2020-09-04 11:13:27 -07:00
Brian Behlendorf
dce63135ae Sequential scrub and resilver updated comments
Commit d4a72f2 which introduced multi-phase scrubs and resilvers
continued the work presented by Nexenta at the 2016 ZFS developer
summit.  Update the source to reflect their contribution.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2020-09-04 10:51:51 -07:00
Don Brady
4f07282786
Avoid posting duplicate zpool events
Duplicate io and checksum ereport events can misrepresent that 
things are worse than they seem. Ideally the zpool events and the 
corresponding vdev stat error counts in a zpool status should be 
for unique errors -- not the same error being counted over and over. 
This can be demonstrated in a simple example. With a single bad 
block in a datafile and just 5 reads of the file we end up with a 
degraded vdev, even though there is only one unique error in the pool.

The proposed solution to the above issue, is to eliminate duplicates 
when posting events and when updating vdev error stats. We now save 
recent error events of interest when posting events so that we can 
easily check for duplicates when posting an error. 

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #10861
2020-09-04 10:34:28 -07:00
Matthew Ahrens
3808032489
nowait synctask must succeed
If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with
`dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC.
However, there is no way to notice or communicate this failure, so it's
extremely difficult to use this functionality correctly, and in fact
almost all callers use `ZFS_SPACE_CHECK_NONE`.

This commit removes the `zfs_space_check_t` argument from
`dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10855
2020-09-04 10:29:39 -07:00
Ryan Moeller
cd80273909
Retain thread name when resuming a zthr
When created, a zthr is given a name to identify it by.  This name is
lost when a cancelled zthr is resumed.

Retain the name of a zthr so it can be used when resuming.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10881
2020-09-03 20:09:52 -07:00
Matthew Macy
ac6e5fb202
Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate
There are a number of places where cv_?_sig is used simply for
accounting purposes but the surrounding code has no ability to
cope with actually receiving a signal. On FreeBSD it is possible
to send signals to individual kernel threads so this could
enable undesirable behavior.

This patch adds routines on Linux that will do the same idle
accounting as _sig without making the task interruptible. On
FreeBSD cv_*_idle  are all aliases for cv_*

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10843
2020-09-03 20:04:09 -07:00
Ryan Moeller
964791acdc
Make spa_stats.c tunables visible on FreeBSD
Use ZFS_MODULE_PARAM for cross-platform tunables in spa_stats.c, and
add update tunables.cfg in tests for the newly supported ones.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10858
2020-09-01 16:19:19 -07:00
Matthew Macy
e84e49218f
FreeBSD: Fix up after spa_stats.c move
Moving spa_stats added the additional burden of supporting
KSTAT_TYPE_IO.

spa_state_addr will always return a valid value regardless of
the value of 'n'. On FreeBSD this will cause an infinite loop
as it relies on the raw ops addr routine to indicate that there
is no more data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10860
2020-09-01 16:16:56 -07:00
Ryan Moeller
7b4e27232d
Add 'zfs rename -u' to rename without remounting
Allow to rename file systems without remounting if it is possible.
It is possible for file systems with 'mountpoint' property set to
'legacy' or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even
mounted back.

This introduces layering violation, as we need to update
'f_mntfromname' field in statfs structure related to mountpoint (for
the dataset we are renaming and all its children).

In my opinion it is worth it, as it allow to update FreeBSD in even
cleaner way - in ZFS-only configuration root file system is ZFS file
system with 'mountpoint' property set to 'legacy'. If root dataset is
named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone
it (system/oldrootfs), update FreeBSD and if it doesn't boot we can
boot back from system/oldrootfs and rename it back to system/rootfs
while it is mounted as /. Before it was not possible, because
unmounting / was not possible.

Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Matt Macy <mmacy@freebsd.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10839
2020-09-01 16:14:16 -07:00
Toomas Soome
1144586b57
zio_ereport_post() and zio_ereport_start() return values are ignored
use (void) to silence analyzers.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Toomas Soome <tsoome@me.com>
Closes #10857
2020-08-31 19:35:11 -07:00
Matthew Macy
7bb18b94c7
Move spa_stats.c to common code
Initially it was considered simplest to stub out all
of the functions on FreeBSD. Now that FreeBSD supports
KSTAT_TYPE_RAW at least some of the functionality should
be made available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10842
2020-08-30 14:12:46 -07:00
Patrick Mooney
8d42c98d95
dnode_sync is careless with range tree
Because dnode_sync_free_range() must drop dn_mtx during its processing,
using it as a callback to range_tree_vacate() is not safe.  No other
operations (besides destroy) are allowed once range_tree_vacate() has
begun, and dropping dn_mtx would leave a window open for another thread
to observe that invalid (and unsafe) state via dnode_block_freed().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Patrick Mooney <pmooney@oxide.computer>
Closes #10708 
Closes #10823
2020-08-26 21:48:29 -07:00
Ryan Moeller
a2f944a140
zpool: Change base URL for ZFS messages to openzfs-docs
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10820
2020-08-26 21:43:06 -07:00
Brian Behlendorf
03f5d2fd6a
Remove duplicate dnode.h include
The zfs/sa.c source file accidentally includes sys/dnode.h twice.
Remove the second occurrence.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10816 
Closes #10819
2020-08-26 21:41:09 -07:00
Paul Dagnelie
4aa3b3bd47
Always track temporary fses and snapshots for accounting
The root cause of the issue is that we only occasionally do as the 
comments in the code suggest and actually ignore the %recv dataset when 
it comes to filesystem limit tracking. Specifically, the only time we 
ignore it is when initializing the filesystem and snapshot limit values; 
when creating a new %recv dataset or deleting one, we always update 
the bookkeeping. This causes a problem if you init the fs count on a 
filesystem that already has a %recv dataset, since the bookmarking 
will be decremented but not incremented. This is resolved in this 
patch by simply always tracking the %recv dataset as a child.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10791
2020-08-26 21:38:27 -07:00
Matthew Macy
2dbad44710
FreeBSD: disable neon usage
The neon support code does not build on FreeBSD,
ifdef out references to fix linker issues on arm64.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10809
2020-08-26 09:54:37 -07:00
Alexander Motin
523e1295fe
Introduce limit on size of L2ARC headers
Since L2ARC buffers are not evicted on memory pressure, too large
amount of headers on system with irrationally large L2ARC can render
it slow or even unusable.  This change limits L2ARC writes and
rebuild if unevictable L2ARC-only headers reach dangerous level.

While there, call arc_adapt() on L2ARC rebuild, so that it could
properly grow arc_c, reflecting potentially significant ARC size
increase and avoiding slow growth with hopeless eviction attempts
later when "overflow" is detected.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #10765
2020-08-25 14:33:36 -07:00
Brian Behlendorf
94dac3e880
Export dmu_offset_next() symbol
Export the dmu_offset_next() symbol for use by Lustre.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10796
2020-08-25 08:34:41 -07:00
Sebastian Gottschall
184df27eef
Avoid symbol collision with in-kernel zstdlib
For Linux, when zfs is compiled as an in kernel static variant
and the in kernel zstd library is compiled statically into the kernel
a symbol collision will occur.  This wrapper header renames all
of the relevant zstd functions to avoid this problem.

Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Closes #10775
2020-08-24 12:20:41 -07:00
Ryan Moeller
6fe3498ca3
Import vdev ashift optimization from FreeBSD
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

 1. Existing pools created with devices that have different logical
    and physical block sizes, but were configured to use the logical
    block size (e.g. because the OS version used for pool construction
    reported the logical block size instead of the physical block
    size) will suddenly find that the vdev allocation size has
    increased. This can be easily tolerated for active members of
    the array, but ZFS would prevent replacement of a vdev with
    another identical device because it now appears that the smaller
    allocation size required by the pool is not supported by the new
    device.

 2. The device's physical block size may be too large to be supported
    by ZFS. The optimal allocation size for the vdev may be quite
    large. For example, a RAID controller may export a vdev that
    requires read-modify-write cycles unless accessed using 64k
    aligned/sized requests. ZFS currently has an 8k minimum block
    size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Matthew Macy <mmacy@freebsd.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10619
2020-08-21 12:53:17 -07:00
Matthew Ahrens
3dc18995bd
Fix indentation in dnode_free_range()
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10744
2020-08-20 11:45:20 -07:00
Matthew Macy
1c2725a157
FreeBSD: 11.x arc_stats compatibility
Removing other_size from arc_stats breaks top in 11.x jails
running on HEAD.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10745
2020-08-20 10:55:02 -07:00
Michael Niewöhner
10b3c7f5e4 Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:

- zstd: A basic ZStandard compression algorithm Available compression.
  Levels for zstd are zstd-1 through zstd-19, where the compression
  increases with every level, but speed decreases.

- zstd-fast: A faster version of the ZStandard compression algorithm
  zstd-fast is basically a "negative" level of zstd. The compression
  decreases with every level, but speed increases.

  Available compression levels for zstd-fast:
   - zstd-fast-1 through zstd-fast-10
   - zstd-fast-20 through zstd-fast-100 (in increments of 10)
   - zstd-fast-500 and zstd-fast-1000

For more information check the man page.

Implementation details:

Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.

The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers).  The upper bits are used to store the compression level.

It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.

All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables.  Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).

The userspace tools all use the combined/bit-shifted value.

Additional notes:

zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.

ZSTD is included with all current tests and new tests are added
as-needed.

Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born.  This is currently only used by zstd but can be
extended as needed.

Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-20 10:30:06 -07:00
Brian Behlendorf
cfd59f904b
Fix ARC aggsum access after arc_state_fini()
Commit 85ec5cbae updated abd_update_scatter_stats() such that it
calls arc_space_consume() and arc_space_return() when updating the
scatter stats.  This requires that the global aggsum value for the
ARC be initialized.  Normally this is not an issue, however during
module unload the l2arc_do_free_on_write() function was called in
l2arc_cleanup() after arc_state_fini() destroyed the aggsum values.
We can resolve this issue by performing l2arc_do_free_on_write()
slightly earlier in arc_fini().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #10739
2020-08-18 22:11:34 -07:00
Matthew Macy
716b53d0a1
FreeBSD: Fix UNIX permissions checking
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10727
2020-08-18 09:57:07 -07:00
Ryan Moeller
009cc8e884
Make zc_nvlist_src_size limit tunable
We limit the size of nvlists passed to the kernel so a user cannot make
the kernel do an unreasonably large allocation.  On FreeBSD this limit
was 128 kiB, which turns out to be a bit too small when doing some
operations involving a large number of datasets or snapshots, for
example replication.

Make this limit tunable, with a platform-specific auto default.
Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the
system limit on user wired memory, which allows it to scale depending
on system configuration.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Issue #6572 
Closes #10706
2020-08-18 09:33:55 -07:00
Richard Laager
eaa25f1a8e
Remove GRUB restrictions
The GRUB restrictions are based around the pool's bootfs property.
Given the current situation where GRUB is not staying current with
OpenZFS pool features, having either a non-ZFS /boot or a separate
pool with limited features are pretty much the only long-term answers
for GRUB support.  Only the second case matters in this context.  For
the restrictions to be useful, the bootfs property would have to be set
on the boot pool, because that is where we need the restrictions, as
that is the pool that GRUB reads from. The documentation for bootfs
describes it as pointing to the root pool. That's also how it's used in
the initramfs. ZFS does not allow setting bootfs to point to a dataset
in another pool. (If it did, it'd be difficult-to-impossible to enforce
these restrictions cross-pool). Accordingly, bootfs is pretty much
useless for GRUB scenarios moving forward.

Even for users who have only one pool, the existing restrictions for
GRUB are incomplete. They don't prevent you from enabling the
unsupported checksums, for example. For that reason, I have ripped out
all the GRUB restrictions.

A little longer-term, I think extending the proposed features=portable
system to define a features=grub is a much more useful approach. The
user could set that on the boot pool at creation, and things would
Just Work.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8627
2020-08-17 23:12:39 -07:00
Matthew Ahrens
85ec5cbae2
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-17 20:04:04 -07:00
Ryan Moeller
3df0c2fa32
FreeBSD: fix the build with Clang 11
* Cast void * to uintptr_t before casting to boolean_t.

* Avoid clashing definition of __asm when not on Linux to
  prevent duplicate __volatile__. This was already done in
  some places but not all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes #10723
2020-08-17 15:40:17 -07:00
Serapheim Dimitropoulos
b0099072df
Fix typo in btree.c
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10725
2020-08-17 15:25:37 -07:00
Matthew Macy
5f1984f2f8
FreeBSD: fallback to /boot/ to look for zpool.cache
Up until now zpool.cache has always lived in /boot on FreeBSD.
For the sake of compatibility fallback to /boot if zpool.cache
isn't found in /etc/zfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10720
2020-08-17 14:43:47 -07:00
Ryan Moeller
3eaf76a8d2
Fix l2arc_dev_rebuild_start thread name
`thread_create` on FreeBSD stringifies the argument passed as the
thread function to create a name for the thread. The thread name for
`l2arc_dev_rebuild_start` ended up with `(void (*)(void *))` in it.

Change the type signature so the function does not need to be cast
when creating the thread.  Rename the function to
`l2arc_dev_rebuild_thread` for clarity and consistency, as well.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10716
2020-08-17 11:02:32 -07:00
Allan Jude
fc34dfba8e
Fix L2ARC reads when compressed ARC disabled
When reading compressed blocks from the L2ARC, with
compressed ARC disabled, arc_hdr_size() returns
LSIZE rather than PSIZE, but the actual read is PSIZE.
This causes l2arc_read_done() to compare the checksum
against the wrong size, resulting in checksum failure.

This manifests as an increase in the kstat l2_cksum_bad
and the read being retried from the main pool, making the
L2ARC ineffective.

Add new L2ARC tests with Compressed ARC enabled/disabled

Blocks are handled differently depending on the state of the
zfs_compressed_arc_enabled tunable.

If a block is compressed on-disk, and compressed_arc is enabled:
- the block is read from disk
- It is NOT decompressed
- It is added to the ARC in its compressed form
- l2arc_write_buffers() may write it to the L2ARC (as is)
- l2arc_read_done() compares the checksum to the BP (compressed)

However, if compressed_arc is disabled:
- the block is read from disk
- It is decompressed
- It is added to the ARC (uncompressed)
- l2arc_write_buffers() will use l2arc_apply_transforms() to
  recompress the block, before writing it to the L2ARC
- l2arc_read_done() compares the checksum to the BP (compressed)
- l2arc_read_done() will use l2arc_untransform() to uncompress it

This test writes out a test file to a pool consisting of one disk
and one cache device, then randomly reads from it. Since the arc_max
in the tests is low, this will feed the L2ARC, and result in reads
from the L2ARC.

We compare the value of the kstat l2_cksum_bad before and after
to determine if any blocks failed to survive the trip through the
L2ARC.

Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #10693
2020-08-13 23:31:20 -07:00
Jorgen Lundman
faa296c73c
Release onexit/events with any missed zfsdev_state
Linux and FreeBSD will most likely never see this issue.
On macOS when kext is unloaded, but zed is still connected, zed
will be issued ENODEV. As the cdevsw is released, the kernel
will not have zfsdev_release() called to release minor/onexit/events,
and it "leaks". This ensures it is cleaned up before unload.

Changed the for loop from zsprev, to zsnext style, for less
code duplication.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #10700
2020-08-13 15:03:23 -07:00
Matthew Ahrens
d64c6a2eee
Use zfs_dbgmsg to log metaslab_load/unload
Metaslabs are now (usually) loaded and unloaded infrequently, but when
that is not the case, it is useful to have a log of when and why these
events happened.

This commit enables the zfs_dbgmsg() in metaslab_load(), and adds a
zfs_dbgmsg() in metaslab_unload().

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10683
2020-08-12 10:10:50 -07:00
Matthew Macy
e111c80247
Restore ARC MFU/MRU pressure
The arc_adapt() function tunes LRU/MLU balance according to 4 types of
cache hits (which is passed as state agrument): ghost LRU, LRU, MRU,
ghost MRU. If this function is called with wrong cache hit (state),
adaptation will be sub-optimal and performance will suffer.

Some time ago upstream received this commit:

6950 ARC should cache compressed data) in arc_read() do next
sequence (access to ghost buffer)

Before this commit, hit to any ghost list was passed arc_adapt() before
call to arc_access() which revive element in cache and change state from
ghost to real hit.

After this commit, the order of calls was reverted and arc_adapt() is
now called only with «real» hits even if hit was in one of two ghost
lists, which renders ghost lists useless and breaks the ARC algorithm.

FreeBSD fixed this problem locally in Change D19094 / Commit r348772.

This change is an adaptation of the above commit to the current arc
code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10548 
Closes #10618
2020-08-12 10:03:24 -07:00
Allan Jude
9777044f1c
Fix typo
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #10694
2020-08-11 13:16:57 -07:00
Paul Dagnelie
12045d0278
Clarify error message when a range-tree double-add occurs
In various other pieces of logic have resulted in situations where 
we double-free space in ZFS. This in turn results in a double-add 
to the range trees. These issues have been much more difficult to 
diagnose than they should have been, because the error handling 
around this case is much weaker than around the double remove case.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #10654
2020-08-07 14:13:13 -07:00
Matthew Ahrens
d87676a9fa
Fix i/o error handling of livelists and zap iteration
Pool-wide metadata is stored in the MOS (Meta Object Set).  This
metadata is stored in triplicate, in addition to any pool-level
reduncancy (e.g. RAIDZ).  However, if all 3+ copies of this metadata are
not available, we can still get EIO/ECKSUM when reading from the MOS.
If we encounter such an error in syncing context, we have typically
already committed to making a change that we now can't do because of the
corrupt/missing metadata.  We typically "handle" this with a `VERIFY()`
or `zfs_panic_recover()`.  This prevents the system from continuing on
in an undefined state, while minimizing the amount of error-handling
code.

However, there are some code paths that ignore these i/o errors, or
`ASSERT()` that they don't happen.  Since assertions are disabled on
non-debug builds, they effectively ignore them as well.  This can lead
to ZFS continuing on in an incorrect state, potentially leading to
on-disk inconsistencies.

This commit adds handling for these i/o errors on MOS metadata,
typically with a `VERIFY()`:

* Handle error return from `zap_cursor_retrieve()` in 4 places in
`dsl_deadlist.c`.

* Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`.
Turns out this call isn't necessary because we can always call
`zap_lookup()`.

* Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`.

* Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`.

* Handle error return from `zap_lookup()` in
`dsl_dir_remove_livelist()`.

* Handle error return from `dsl_process_sub_livelist()` in
`spa_livelist_delete_cb()`.

Additionally:

* Augment the internal history log message for `zfs destroy` to note
which method is used (e.g. bptree, livelist, or, synchronous) and the
mintxg.

* Correct a comment in `dbuf_init()`.

* Correct indentation in `dsl_dir_remove_livelist()`.

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10643
2020-08-05 10:22:09 -07:00
Matthew Macy
22dcf89181
Add missed thread_exit() to vdev_{autotrim,rebuild}_thread
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10668
2020-08-05 10:17:07 -07:00
George Amanakis
da60484db5
Fix logging in l2arc_rebuild()
In case the L2ARC rebuild was canceled, do not log to spa history
log as the pool may be in the process of being removed and a panic
may occur:

BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:spa_history_log_internal+0xb1/0x120 [zfs]
Call Trace:
 l2arc_rebuild+0x464/0x7c0 [zfs]
 l2arc_dev_rebuild_start+0x2d/0x130 [zfs]
 ? l2arc_rebuild+0x7c0/0x7c0 [zfs]
 thread_generic_wrapper+0x78/0xb0 [spl]
 kthread+0xfb/0x130
 ? IS_ERR+0x10/0x10 [spl]
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10659
2020-08-01 11:17:18 -07:00
Allan Jude
8fb79fdddb
Change the error handling for invalid property values
ZFS recv should return a useful error message when an invalid index
property value is provided in the send stream properties nvlist

With a compression= property outside of the understood range:

Before:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
internal error: Invalid argument
Aborted (core dumped)
```
Note: the recv completes successfully, the abort() is likely just to
make it easier to track the unexpected error code.

After:
```
receiving full stream of zof/zstd_send@send2 into testpool/recv@send2
cannot receive compression property on testpool/recv: invalid property
value received 28.9M stream in 1 seconds (28.9M/sec)
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10631
2020-08-01 08:41:31 -07:00
Matthew Macy
47ed79ff60
Changes to make openzfs build within FreeBSD buildworld
A collection of header changes to enable FreeBSD to build
with vendored OpenZFS.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10635
2020-07-31 21:30:31 -07:00
Matthew Ahrens
3442c2a02d
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages.  This happens via 2 code paths:

1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory.  The ARC shrinker callback is invoked from the
page-allocation code path.

2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.

In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released.  However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`).  Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).

Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`.  This has several
negative impacts:

1. ZFS doesn't use RAM to cache data effectively.

2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.

3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.

To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.

With this commit:

1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker.  This limits the amount of time it takes
to allocate a single page.

2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system.  Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).

3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing.  This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.

4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.

5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-07-31 21:10:52 -07:00
Allan Jude
eabf270b2c
Remove duplicate include of sys/zfeature.h in dmu_objset.c
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #10636
2020-07-31 09:04:45 -07:00
Matthew Ahrens
948423a3d1
zfs promote does not delete livelist of origin
When a clone is promoted, its livelist is no longer accurate, so it is
discarded.  If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.

Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B

If we promote C, it discards C's livelist.  It should discard B's
livelist, but that is not happening.  The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C.  The incorrectly-freed blocks can be reallocated causing
checksum errors.  And when C is destroyed it can double-free the
incorrectly-freed blocks.

The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir.  So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed.  As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10652
2020-07-31 08:59:00 -07:00
Matthew Ahrens
3a92552f75
Fix error handling of vdev_top_zap
In `vdev_load()`, we look up several entries in the `vdev_top_zap`
object.  In most cases, if we encounter an i/o error, it will be
returned to the caller.  However, when handling
`VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue
on, which in theory could cause us to not realize that a vdev should be
used only for `special` allocations.

In practice, if we encountered an i/o error while looking for
`VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an
i/o error while looking for other entries in the same object, and thus
the zpool open/import would fail.  Therefore the impact of this problem
is negligible.

This commit adds error handling for i/o errors while accessing the
`vdev_top_zap`, so that we aren't relying on unrelated code to fail for
us.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10637
2020-07-29 17:04:34 -07:00
Matthew Macy
27d96d2254
Rename refcount.h to zfs_refcount.h
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10620
2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos
843e9ca2e1
Introduce names for ZTHRs
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #10630
2020-07-29 09:43:33 -07:00
Matthew Macy
5678d3f593
Prefix zfs internal endian checks with _ZFS
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10621
2020-07-28 13:02:49 -07:00
Matthew Macy
e64cc4954c
Refactor ccompile.h to not include system headers
This is a step toward being able to vendor the OpenZFS code in FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10625
2020-07-25 20:09:50 -07:00
Matthew Macy
6d8da84106
Make use of ZFS_DEBUG consistent within kmod sources
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10623
2020-07-25 20:07:44 -07:00
Matthew Macy
f5b189f937
FreeBSD: Fixes required to build ZFS on PowerPC
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #10622
2020-07-25 11:00:23 -07:00
Brian Atkinson
6fba7bfd0e
Add gang ABD child to parent gang ABD
By design a gang ABD can not have another gang ABD as a child. This is
to make sure the logical offset in a gang ABD is consistent with the
individual ABDS it contains as children. If a gang ABD is added as a
child of a gang ABD we will add the individual children of the gang ABD
to the parent gang ABD. This allows for a consistent view of offsets
within the parent gang ABD.

Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10430
2020-07-24 21:09:20 -07:00
Ryan Moeller
8348fac30c
Limit dbuf cache sizes based only on ARC target size by default
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.

Recalculate the metadata cache size on demand so it can adapt, too.

Update descriptions in zfs-module-parameters(5).

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #10563 
Closes #10610
2020-07-24 20:38:48 -07:00
Matthew Ahrens
5dd92909c6
Adjust ARC terminology
The process of evicting data from the ARC is referred to as
`arc_adjust`.

This commit changes the term to `arc_evict`, which is more specific.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10592
2020-07-22 09:51:47 -07:00
Matthew Ahrens
026e529cb3
Remove skc_reclaim, hdr_recl, kmem_cache shrinker
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure.  This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function.  This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.

The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`.  When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op.  When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it.  Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.

The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer.  Therefore, additionally waking it from the
shrinker calbacks has little benefit.

The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second.  Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times.  Therefore, this mechanism consumes
significant amounts of CPU time.

The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
 1. Return free objects from the magazine layer to the slab layer
 2. Return entirely-free slabs to the page layer (i.e. free pages)

These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches.  The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.

These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine.  So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.

We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected.  Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.

This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.

Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10576
2020-07-19 09:58:30 -07:00
Matthew Ahrens
6774931dfa
Extend zdb to print inconsistencies in livelists and metaslabs
Livelists and spacemaps are data structures that are logs of allocations
and frees.  Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.

These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).

ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab.  However, it
generally halts processing when the error is detected.

When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.

The specific checks added are:

Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]

Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
  spacemap log, then report any double FREEs and double ALLOCs

Verify that livelists are consistenet with spacemaps.  The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes #10515
2020-07-14 17:51:05 -07:00
Alexander Motin
1743c737f5
Fix LOR between dp_config_rwlock and spa_props_lock
Our QE team during automated API testing hit deadlock in ZFS, caused
by lock order reversal.  From one side dsl_sync_task_sync() locks
dp_config_rwlock as writer and calls spa_sync_props(), which waits
for spa_props_lock.  From another spa_prop_get() locks spa_props_lock
and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock
as reader.

This patch makes spa_prop_get() lock dp_config_rwlock before
spa_props_lock, making the order consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #10553
2020-07-14 12:21:57 -07:00
Brian Atkinson
e4d3d77684
Fixing gang ABD child removal race condition
On linux the list debug code has been setting off a failure when
checking that the node->next->prev value is pointing back at the node.
At times this check evaluates to 0xdead. When removing a child from a
gang ABD we must acquire the child's abd_mtx to make sure that the
same ABD is not being added to another gang ABD while it is being
removed from a gang ABD. This fixes a race condition when checking
if an ABDs link is already active and part of another gang ABD before
adding it to a gang.

Added additional debug code for the gang ABD in abd_verify() to make
sure each child ABD has active links. Also check to make sure another
gang ABD is not added to a gang ABD.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #10511
2020-07-14 11:04:35 -07:00
Matthew Ahrens
e59a377a8f
filesystem_limit/snapshot_limit is incorrectly enforced against root
The filesystem_limit and snapshot_limit properties limit the number of
filesystems or snapshots that can be created below this dataset.
According to the manpage, "The limit is not enforced if the user is
allowed to change the limit."  Two types of users are allowed to change
the limit:

1. Those that have been delegated the `filesystem_limit` or
`snapshot_limit` permission, e.g. with
`zfs allow USER filesystem_limit DATASET`.  This works properly.

2. A user with elevated system privileges (e.g. root).  This does not
work - the root user will incorrectly get an error when trying to create
a snapshot/filesystem, if it exceeds the `_limit` property.

The problem is that `priv_policy_ns()` does not work if the `cred_t` is
not that of the current process.  This happens when
`dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a
sync task's check func) to determine the permissions of the
corresponding user process.

This commit fixes the issue by passing the `task_struct` (typedef'ed as
a `proc_t`) to syncing context, and then using `has_capability()` to
determine if that process is privileged.  Note that we still need to
pass the `cred_t` to syncing context so that we can check if the user
was delegated this permission with `zfs allow`.

This problem only impacts Linux.  Wrappers are added to FreeBSD but it
continues to use `priv_check_cred()`, which works on arbitrary `cred_t`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8226
Closes #10545
2020-07-11 17:18:02 -07:00