2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2017-03-16 02:41:52 +03:00
|
|
|
* Copyright (c) 2018, Joyent, Inc.
|
2020-09-04 20:34:28 +03:00
|
|
|
* Copyright (c) 2011, 2020, Delphix. All rights reserved.
|
2020-04-10 20:33:35 +03:00
|
|
|
* Copyright (c) 2014, Saso Kiselkov. All rights reserved.
|
|
|
|
* Copyright (c) 2017, Nexenta Systems, Inc. All rights reserved.
|
2019-10-27 01:22:19 +03:00
|
|
|
* Copyright (c) 2019, loli10K <ezomori.nozomu@gmail.com>. All rights reserved.
|
2020-04-10 20:33:35 +03:00
|
|
|
* Copyright (c) 2020, George Amanakis. All rights reserved.
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
* Copyright (c) 2019, Klara Inc.
|
|
|
|
* Copyright (c) 2019, Allan Jude
|
Fix L2ARC reads when compressed ARC disabled
When reading compressed blocks from the L2ARC, with
compressed ARC disabled, arc_hdr_size() returns
LSIZE rather than PSIZE, but the actual read is PSIZE.
This causes l2arc_read_done() to compare the checksum
against the wrong size, resulting in checksum failure.
This manifests as an increase in the kstat l2_cksum_bad
and the read being retried from the main pool, making the
L2ARC ineffective.
Add new L2ARC tests with Compressed ARC enabled/disabled
Blocks are handled differently depending on the state of the
zfs_compressed_arc_enabled tunable.
If a block is compressed on-disk, and compressed_arc is enabled:
- the block is read from disk
- It is NOT decompressed
- It is added to the ARC in its compressed form
- l2arc_write_buffers() may write it to the L2ARC (as is)
- l2arc_read_done() compares the checksum to the BP (compressed)
However, if compressed_arc is disabled:
- the block is read from disk
- It is decompressed
- It is added to the ARC (uncompressed)
- l2arc_write_buffers() will use l2arc_apply_transforms() to
recompress the block, before writing it to the L2ARC
- l2arc_read_done() compares the checksum to the BP (compressed)
- l2arc_read_done() will use l2arc_untransform() to uncompress it
This test writes out a test file to a pool consisting of one disk
and one cache device, then randomly reads from it. Since the arc_max
in the tests is low, this will feed the L2ARC, and result in reads
from the L2ARC.
We compare the value of the kstat l2_cksum_bad before and after
to determine if any blocks failed to survive the trip through the
L2ARC.
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #10693
2020-08-14 09:31:20 +03:00
|
|
|
* Copyright (c) 2020, The FreeBSD Foundation [1]
|
|
|
|
*
|
|
|
|
* [1] Portions of this software were developed by Allan Jude
|
|
|
|
* under sponsorship from the FreeBSD Foundation.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* DVA-based Adjustable Replacement Cache
|
|
|
|
*
|
|
|
|
* While much of the theory of operation used here is
|
|
|
|
* based on the self-tuning, low overhead replacement cache
|
|
|
|
* presented by Megiddo and Modha at FAST 2003, there are some
|
|
|
|
* significant differences:
|
|
|
|
*
|
|
|
|
* 1. The Megiddo and Modha model assumes any page is evictable.
|
|
|
|
* Pages in its cache cannot be "locked" into memory. This makes
|
|
|
|
* the eviction algorithm simple: evict the last page in the list.
|
|
|
|
* This also make the performance characteristics easy to reason
|
|
|
|
* about. Our cache is not so simple. At any given moment, some
|
|
|
|
* subset of the blocks in the cache are un-evictable because we
|
|
|
|
* have handed out a reference to them. Blocks are only evictable
|
|
|
|
* when there are no external references active. This makes
|
|
|
|
* eviction far more problematic: we choose to evict the evictable
|
|
|
|
* blocks that are the "lowest" in the list.
|
|
|
|
*
|
|
|
|
* There are times when it is not possible to evict the requested
|
|
|
|
* space. In these circumstances we are unable to adjust the cache
|
|
|
|
* size. To prevent the cache growing unbounded at these times we
|
|
|
|
* implement a "cache throttle" that slows the flow of new data
|
|
|
|
* into the cache until we can make space available.
|
|
|
|
*
|
|
|
|
* 2. The Megiddo and Modha model assumes a fixed cache size.
|
|
|
|
* Pages are evicted when the cache is full and there is a cache
|
|
|
|
* miss. Our model has a variable sized cache. It grows with
|
|
|
|
* high use, but also tries to react to memory pressure from the
|
|
|
|
* operating system: decreasing its size when system memory is
|
|
|
|
* tight.
|
|
|
|
*
|
|
|
|
* 3. The Megiddo and Modha model assumes a fixed page size. All
|
2013-06-11 21:12:34 +04:00
|
|
|
* elements of the cache are therefore exactly the same size. So
|
2008-11-20 23:01:55 +03:00
|
|
|
* when adjusting the cache size following a cache miss, its simply
|
|
|
|
* a matter of choosing a single page to evict. In our model, we
|
2019-09-03 03:56:41 +03:00
|
|
|
* have variable sized cache blocks (ranging from 512 bytes to
|
2013-06-11 21:12:34 +04:00
|
|
|
* 128K bytes). We therefore choose a set of blocks to evict to make
|
2008-11-20 23:01:55 +03:00
|
|
|
* space for a cache miss that approximates as closely as possible
|
|
|
|
* the space used by the new block.
|
|
|
|
*
|
|
|
|
* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
|
|
|
|
* by N. Megiddo & D. Modha, FAST 2003
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The locking model:
|
|
|
|
*
|
|
|
|
* A new reference to a cache buffer can be obtained in two
|
|
|
|
* ways: 1) via a hash table lookup using the DVA as a key,
|
|
|
|
* or 2) via one of the ARC lists. The arc_read() interface
|
2016-07-11 20:45:52 +03:00
|
|
|
* uses method 1, while the internal ARC algorithms for
|
2013-06-11 21:12:34 +04:00
|
|
|
* adjusting the cache use method 2. We therefore provide two
|
2008-11-20 23:01:55 +03:00
|
|
|
* types of locks: 1) the hash table lock array, and 2) the
|
2016-07-11 20:45:52 +03:00
|
|
|
* ARC list locks.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2013-01-11 20:54:18 +04:00
|
|
|
* Buffers do not have their own mutexes, rather they rely on the
|
|
|
|
* hash table mutexes for the bulk of their protection (i.e. most
|
|
|
|
* fields in the arc_buf_hdr_t are protected by these mutexes).
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* buf_hash_find() returns the appropriate mutex (held) when it
|
|
|
|
* locates the requested buffer in the hash table. It returns
|
|
|
|
* NULL for the mutex if the buffer was not in the table.
|
|
|
|
*
|
|
|
|
* buf_hash_remove() expects the appropriate hash mutex to be
|
|
|
|
* already held before it is invoked.
|
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Each ARC state also has a mutex which is used to protect the
|
2008-11-20 23:01:55 +03:00
|
|
|
* buffer list associated with the state. When attempting to
|
2016-07-11 20:45:52 +03:00
|
|
|
* obtain a hash table lock while holding an ARC list lock you
|
2008-11-20 23:01:55 +03:00
|
|
|
* must use: mutex_tryenter() to avoid deadlock. Also note that
|
|
|
|
* the active state mutex must be held before the ghost state mutex.
|
|
|
|
*
|
2011-12-23 00:20:43 +04:00
|
|
|
* It as also possible to register a callback which is run when the
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* metadata limit is reached and no buffers can be safely evicted. In
|
2011-12-23 00:20:43 +04:00
|
|
|
* this case the arc user should drop a reference on some arc buffers so
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* they can be reclaimed. For example, when using the ZPL each dentry
|
|
|
|
* holds a references on a znode. These dentries must be pruned before
|
|
|
|
* the arc buffer holding the znode can be safely evicted.
|
2011-12-23 00:20:43 +04:00
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* Note that the majority of the performance stats are manipulated
|
|
|
|
* with atomic operations.
|
|
|
|
*
|
2014-12-30 06:12:23 +03:00
|
|
|
* The L2ARC uses the l2ad_mtx on each vdev for the following:
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* - L2ARC buflist creation
|
|
|
|
* - L2ARC buflist eviction
|
|
|
|
* - L2ARC write completion, which walks L2ARC buflists
|
|
|
|
* - ARC header destruction, as it removes from L2ARC buflists
|
|
|
|
* - ARC header release, as it removes from L2ARC buflists
|
|
|
|
*/
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* ARC operation:
|
|
|
|
*
|
|
|
|
* Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
|
|
|
|
* This structure can point either to a block that is still in the cache or to
|
|
|
|
* one that is only accessible in an L2 ARC device, or it can provide
|
|
|
|
* information about a block that was recently evicted. If a block is
|
|
|
|
* only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
|
|
|
|
* information to retrieve it from the L2ARC device. This information is
|
|
|
|
* stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
|
|
|
|
* that is in this state cannot access the data directly.
|
|
|
|
*
|
|
|
|
* Blocks that are actively being referenced or have not been evicted
|
|
|
|
* are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
|
|
|
|
* the arc_buf_hdr_t that will point to the data block in memory. A block can
|
|
|
|
* only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
|
2016-07-11 20:45:52 +03:00
|
|
|
* caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
|
2016-07-22 18:52:49 +03:00
|
|
|
* also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).
|
2016-07-11 20:45:52 +03:00
|
|
|
*
|
|
|
|
* The L1ARC's data pointer may or may not be uncompressed. The ARC has the
|
2016-07-22 18:52:49 +03:00
|
|
|
* ability to store the physical data (b_pabd) associated with the DVA of the
|
|
|
|
* arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,
|
2016-07-11 20:45:52 +03:00
|
|
|
* it will match its on-disk compression characteristics. This behavior can be
|
|
|
|
* disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
|
2016-07-22 18:52:49 +03:00
|
|
|
* compressed ARC functionality is disabled, the b_pabd will point to an
|
2016-07-11 20:45:52 +03:00
|
|
|
* uncompressed version of the on-disk data.
|
|
|
|
*
|
|
|
|
* Data in the L1ARC is not accessed by consumers of the ARC directly. Each
|
|
|
|
* arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
|
|
|
|
* Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
|
|
|
|
* consumer. The ARC will provide references to this data and will keep it
|
|
|
|
* cached until it is no longer in use. The ARC caches only the L1ARC's physical
|
|
|
|
* data block and will evict any arc_buf_t that is no longer referenced. The
|
|
|
|
* amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
|
2016-06-02 07:04:53 +03:00
|
|
|
* "overhead_size" kstat.
|
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Depending on the consumer, an arc_buf_t can be requested in uncompressed or
|
|
|
|
* compressed form. The typical case is that consumers will want uncompressed
|
|
|
|
* data, and when that happens a new data buffer is allocated where the data is
|
|
|
|
* decompressed for them to use. Currently the only consumer who wants
|
|
|
|
* compressed arc_buf_t's is "zfs send", when it streams data exactly as it
|
|
|
|
* exists on disk. When this happens, the arc_buf_t's data buffer is shared
|
|
|
|
* with the arc_buf_hdr_t.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
|
|
|
|
* first one is owned by a compressed send consumer (and therefore references
|
|
|
|
* the same compressed data buffer as the arc_buf_hdr_t) and the second could be
|
|
|
|
* used by any other consumer (and has its own uncompressed copy of the data
|
|
|
|
* buffer).
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* arc_buf_hdr_t
|
|
|
|
* +-----------+
|
|
|
|
* | fields |
|
|
|
|
* | common to |
|
|
|
|
* | L1- and |
|
|
|
|
* | L2ARC |
|
|
|
|
* +-----------+
|
|
|
|
* | l2arc_buf_hdr_t
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* | l1arc_buf_hdr_t
|
|
|
|
* | | arc_buf_t
|
|
|
|
* | b_buf +------------>+-----------+ arc_buf_t
|
2016-07-22 18:52:49 +03:00
|
|
|
* | b_pabd +-+ |b_next +---->+-----------+
|
2016-07-11 20:45:52 +03:00
|
|
|
* +-----------+ | |-----------| |b_next +-->NULL
|
|
|
|
* | |b_comp = T | +-----------+
|
|
|
|
* | |b_data +-+ |b_comp = F |
|
|
|
|
* | +-----------+ | |b_data +-+
|
|
|
|
* +->+------+ | +-----------+ |
|
|
|
|
* compressed | | | |
|
|
|
|
* data | |<--------------+ | uncompressed
|
|
|
|
* +------+ compressed, | data
|
|
|
|
* shared +-->+------+
|
|
|
|
* data | |
|
|
|
|
* | |
|
|
|
|
* +------+
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* When a consumer reads a block, the ARC must first look to see if the
|
2016-07-11 20:45:52 +03:00
|
|
|
* arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
|
|
|
|
* arc_buf_t and either copies uncompressed data into a new data buffer from an
|
2016-07-22 18:52:49 +03:00
|
|
|
* existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a
|
|
|
|
* new data buffer, or shares the hdr's b_pabd buffer, depending on whether the
|
2016-07-11 20:45:52 +03:00
|
|
|
* hdr is compressed and the desired compression characteristics of the
|
|
|
|
* arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
|
|
|
|
* arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
|
|
|
|
* the last buffer in the hdr's b_buf list, however a shared compressed buf can
|
|
|
|
* be anywhere in the hdr's list.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* The diagram below shows an example of an uncompressed ARC hdr that is
|
2016-07-11 20:45:52 +03:00
|
|
|
* sharing its data with an arc_buf_t (note that the shared uncompressed buf is
|
|
|
|
* the last element in the buf list):
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* arc_buf_hdr_t
|
|
|
|
* +-----------+
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* l2arc_buf_hdr_t| |
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* l1arc_buf_hdr_t| |
|
|
|
|
* | | arc_buf_t (shared)
|
|
|
|
* | b_buf +------------>+---------+ arc_buf_t
|
|
|
|
* | | |b_next +---->+---------+
|
2016-07-22 18:52:49 +03:00
|
|
|
* | b_pabd +-+ |---------| |b_next +-->NULL
|
2016-06-02 07:04:53 +03:00
|
|
|
* +-----------+ | | | +---------+
|
|
|
|
* | |b_data +-+ | |
|
|
|
|
* | +---------+ | |b_data +-+
|
|
|
|
* +->+------+ | +---------+ |
|
|
|
|
* | | | |
|
|
|
|
* uncompressed | | | |
|
|
|
|
* data +------+ | |
|
|
|
|
* ^ +->+------+ |
|
|
|
|
* | uncompressed | | |
|
|
|
|
* | data | | |
|
|
|
|
* | +------+ |
|
|
|
|
* +---------------------------------+
|
|
|
|
*
|
2016-07-22 18:52:49 +03:00
|
|
|
* Writing to the ARC requires that the ARC first discard the hdr's b_pabd
|
2016-06-02 07:04:53 +03:00
|
|
|
* since the physical block is about to be rewritten. The new data contents
|
2016-07-11 20:45:52 +03:00
|
|
|
* will be contained in the arc_buf_t. As the I/O pipeline performs the write,
|
|
|
|
* it may compress the data before writing it to disk. The ARC will be called
|
2022-02-25 16:26:54 +03:00
|
|
|
* with the transformed data and will memcpy the transformed on-disk block into
|
2016-07-22 18:52:49 +03:00
|
|
|
* a newly allocated b_pabd. Writes are always done into buffers which have
|
2016-07-11 20:45:52 +03:00
|
|
|
* either been loaned (and hence are new and don't have other readers) or
|
|
|
|
* buffers which have been released (and hence have their own hdr, if there
|
|
|
|
* were originally other readers of the buf's original hdr). This ensures that
|
|
|
|
* the ARC only needs to update a single buf and its hdr after a write occurs.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
2016-07-22 18:52:49 +03:00
|
|
|
* When the L2ARC is in use, it will also take advantage of the b_pabd. The
|
|
|
|
* L2ARC will always write the contents of b_pabd to the L2ARC. This means
|
2016-07-11 20:45:52 +03:00
|
|
|
* that when compressed ARC is enabled that the L2ARC blocks are identical
|
2016-06-02 07:04:53 +03:00
|
|
|
* to the on-disk block in the main data pool. This provides a significant
|
|
|
|
* advantage since the ARC can leverage the bp's checksum when reading from the
|
|
|
|
* L2ARC to determine if the contents are valid. However, if the compressed
|
2016-07-11 20:45:52 +03:00
|
|
|
* ARC is disabled, then the L2ARC's block must be transformed to look
|
2016-06-02 07:04:53 +03:00
|
|
|
* like the physical block in the main data pool before comparing the
|
|
|
|
* checksum and determining its validity.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*
|
|
|
|
* The L1ARC has a slightly different system for storing encrypted data.
|
|
|
|
* Raw (encrypted + possibly compressed) data has a few subtle differences from
|
|
|
|
* data that is just compressed. The biggest difference is that it is not
|
2019-09-03 03:56:41 +03:00
|
|
|
* possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* The other difference is that encryption cannot be treated as a suggestion.
|
|
|
|
* If a caller would prefer compressed data, but they actually wind up with
|
|
|
|
* uncompressed data the worst thing that could happen is there might be a
|
|
|
|
* performance hit. If the caller requests encrypted data, however, we must be
|
|
|
|
* sure they actually get it or else secret information could be leaked. Raw
|
|
|
|
* data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore,
|
|
|
|
* may have both an encrypted version and a decrypted version of its data at
|
|
|
|
* once. When a caller needs a raw arc_buf_t, it is allocated and the data is
|
|
|
|
* copied out of this header. To avoid complications with b_pabd, raw buffers
|
|
|
|
* cannot be shared.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zio.h>
|
2016-06-02 07:04:53 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2013-08-02 00:02:10 +04:00
|
|
|
#include <sys/zio_compress.h>
|
2016-06-02 07:04:53 +03:00
|
|
|
#include <sys/zio_checksum.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/arc.h>
|
2020-07-30 02:35:33 +03:00
|
|
|
#include <sys/zfs_refcount.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/vdev.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/vdev_impl.h>
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
#include <sys/dsl_pool.h>
|
2015-01-13 06:52:19 +03:00
|
|
|
#include <sys/multilist.h>
|
2016-07-22 18:52:49 +03:00
|
|
|
#include <sys/abd.h>
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/fm/fs/zfs.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/callb.h>
|
|
|
|
#include <sys/kstat.h>
|
2017-03-16 02:41:52 +03:00
|
|
|
#include <sys/zthr.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <zfs_fletcher.h>
|
2014-10-22 04:59:33 +04:00
|
|
|
#include <sys/arc_impl.h>
|
Enable use of DTRACE_PROBE* macros in "spl" module
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #9525
2019-10-30 21:02:41 +03:00
|
|
|
#include <sys/trace_zfs.h>
|
2017-05-25 21:32:40 +03:00
|
|
|
#include <sys/aggsum.h>
|
2021-05-27 23:27:29 +03:00
|
|
|
#include <sys/wmsum.h>
|
2020-03-27 19:11:22 +03:00
|
|
|
#include <cityhash.h>
|
2020-06-09 20:15:08 +03:00
|
|
|
#include <sys/vdev_trim.h>
|
2021-02-20 09:34:33 +03:00
|
|
|
#include <sys/zfs_racct.h>
|
2020-09-30 23:22:34 +03:00
|
|
|
#include <sys/zstd/zstd.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-17 01:18:06 +04:00
|
|
|
#ifndef _KERNEL
|
|
|
|
/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
|
|
|
|
boolean_t arc_watch = B_FALSE;
|
|
|
|
#endif
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* This thread's job is to keep enough free memory in the system, by
|
|
|
|
* calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves
|
|
|
|
* arc_available_memory().
|
|
|
|
*/
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
static zthr_t *arc_reap_zthr;
|
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This thread's job is to keep arc_size under arc_c, by calling
|
2020-07-22 19:51:47 +03:00
|
|
|
* arc_evict(), which improves arc_is_overflowing().
|
2017-03-16 02:41:52 +03:00
|
|
|
*/
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
static zthr_t *arc_evict_zthr;
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
static arc_buf_hdr_t **arc_state_evict_markers;
|
|
|
|
static int arc_state_evict_marker_count;
|
2017-03-16 02:41:52 +03:00
|
|
|
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
static kmutex_t arc_evict_lock;
|
|
|
|
static boolean_t arc_evict_needed = B_FALSE;
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
static clock_t arc_last_uncached_flush;
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Count of bytes evicted since boot.
|
|
|
|
*/
|
|
|
|
static uint64_t arc_evict_count;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* List of arc_evict_waiter_t's, representing threads waiting for the
|
|
|
|
* arc_evict_count to reach specific values.
|
|
|
|
*/
|
|
|
|
static list_t arc_evict_waiters;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When arc_is_overflowing(), arc_get_data_impl() waits for this percent of
|
|
|
|
* the requested amount of data to be evicted. For example, by default for
|
|
|
|
* every 2KB that's evicted, 1KB of it may be "reused" by a new allocation.
|
|
|
|
* Since this is above 100%, it ensures that progress is made towards getting
|
|
|
|
* arc_size under arc_c. Since this is finite, it ensures that allocations
|
|
|
|
* can still happen, even during the potentially long time that arc_size is
|
|
|
|
* more than arc_c.
|
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t zfs_arc_eviction_pct = 200;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* The number of headers to evict in arc_evict_state_impl() before
|
|
|
|
* dropping the sublist lock and evicting from another sublist. A lower
|
|
|
|
* value means we're more likely to evict the "correct" header (i.e. the
|
|
|
|
* oldest header in the arc state), but comes with higher overhead
|
|
|
|
* (i.e. more invocations of arc_evict_state_impl()).
|
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t zfs_arc_evict_batch_limit = 10;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* number of seconds before growing cache again */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t arc_grow_retry = 5;
|
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Minimum time between calls to arc_kmem_reap_soon().
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const int arc_kmem_cache_reap_retry_ms = 1000;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
/* shift of arc_c for calculating overflow limit in arc_get_data_impl */
|
2022-01-15 02:37:55 +03:00
|
|
|
static int zfs_arc_overflow_shift = 8;
|
2014-01-03 22:36:26 +04:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/* log2(fraction of arc to reclaim) */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t arc_shrink_shift = 7;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2017-03-16 04:34:56 +03:00
|
|
|
/* percent of pagecache to reclaim arc to */
|
|
|
|
#ifdef _KERNEL
|
2019-10-18 20:23:19 +03:00
|
|
|
uint_t zfs_arc_pc_percent = 0;
|
2017-03-16 04:34:56 +03:00
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2015-06-26 21:28:18 +03:00
|
|
|
* log2(fraction of ARC which must be free to allow growing).
|
|
|
|
* I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
|
|
|
|
* when reading a new block into the ARC, we will evict an equal-sized block
|
|
|
|
* from the ARC.
|
|
|
|
*
|
|
|
|
* This must be less than arc_shrink_shift, so that when we shrink the ARC,
|
|
|
|
* we will still not allow it to grow.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t arc_no_grow_shift = 5;
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2014-08-20 21:09:40 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* minimum lifespan of a prefetch block in clock ticks
|
|
|
|
* (initialized in arc_init())
|
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t arc_min_prefetch_ms;
|
|
|
|
static uint_t arc_min_prescient_prefetch_ms;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* If this percent of memory is free, don't throttle.
|
|
|
|
*/
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t arc_lotsfree_percent = 10;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* The arc has filled available memory and has now warmed up.
|
|
|
|
*/
|
2019-10-18 20:23:19 +03:00
|
|
|
boolean_t arc_warm;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* These tunables are for performance analysis.
|
|
|
|
*/
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
uint64_t zfs_arc_max = 0;
|
|
|
|
uint64_t zfs_arc_min = 0;
|
|
|
|
static uint64_t zfs_arc_dnode_limit = 0;
|
|
|
|
static uint_t zfs_arc_dnode_reduce_percent = 10;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t zfs_arc_grow_retry = 0;
|
|
|
|
static uint_t zfs_arc_shrink_shift = 0;
|
|
|
|
uint_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-09-27 04:45:19 +03:00
|
|
|
/*
|
2022-01-15 02:37:55 +03:00
|
|
|
* ARC dirty data constraints for arc_tempreserve_space() throttle:
|
|
|
|
* * total dirty data limit
|
|
|
|
* * anon block dirty limit
|
|
|
|
* * each pool's anon allowance
|
2017-09-27 04:45:19 +03:00
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const unsigned long zfs_arc_dirty_limit_percent = 50;
|
|
|
|
static const unsigned long zfs_arc_anon_limit_percent = 25;
|
|
|
|
static const unsigned long zfs_arc_pool_dirty_percent = 20;
|
2017-09-27 04:45:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Enable or disable compressed arc buffers.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
int zfs_compressed_arc_enabled = B_TRUE;
|
|
|
|
|
2016-08-11 06:15:37 +03:00
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* Balance between metadata and data on ghost hits. Values above 100
|
|
|
|
* increase metadata caching by proportionally reducing effect of ghost
|
|
|
|
* data hits on target data/metadata rate.
|
2016-08-11 06:15:37 +03:00
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
static uint_t zfs_arc_meta_balance = 500;
|
2016-08-11 06:15:37 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Percentage that can be consumed by dnodes of ARC meta buffers.
|
|
|
|
*/
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint_t zfs_arc_dnode_limit_percent = 10;
|
2016-08-11 06:15:37 +03:00
|
|
|
|
2015-03-18 01:08:22 +03:00
|
|
|
/*
|
2022-01-15 02:37:55 +03:00
|
|
|
* These tunables are Linux-specific
|
2015-03-18 01:08:22 +03:00
|
|
|
*/
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint64_t zfs_arc_sys_free = 0;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t zfs_arc_min_prefetch_ms = 0;
|
|
|
|
static uint_t zfs_arc_min_prescient_prefetch_ms = 0;
|
|
|
|
static uint_t zfs_arc_lotsfree_percent = 10;
|
2015-03-18 01:08:22 +03:00
|
|
|
|
2021-12-23 04:07:13 +03:00
|
|
|
/*
|
|
|
|
* Number of arc_prune threads
|
|
|
|
*/
|
|
|
|
static int zfs_arc_prune_task_threads = 1;
|
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
/* The 7 states: */
|
2019-10-02 02:35:05 +03:00
|
|
|
arc_state_t ARC_anon;
|
|
|
|
arc_state_t ARC_mru;
|
|
|
|
arc_state_t ARC_mru_ghost;
|
|
|
|
arc_state_t ARC_mfu;
|
|
|
|
arc_state_t ARC_mfu_ghost;
|
|
|
|
arc_state_t ARC_l2c_only;
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_state_t ARC_uncached;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-18 20:23:19 +03:00
|
|
|
arc_stats_t arc_stats = {
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hits", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "iohits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_data_hits", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "demand_data_iohits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "demand_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "demand_metadata_iohits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "prefetch_data_iohits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "prefetch_metadata_iohits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
{ "uncached_hits", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "deleted", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mutex_miss", KSTAT_DATA_UINT64 },
|
2018-01-08 20:52:36 +03:00
|
|
|
{ "access_skip", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "evict_skip", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "evict_not_enough", KSTAT_DATA_UINT64 },
|
2010-05-29 00:45:14 +04:00
|
|
|
{ "evict_l2_cached", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
{ "evict_l2_eligible_mfu", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_eligible_mru", KSTAT_DATA_UINT64 },
|
2010-05-29 00:45:14 +04:00
|
|
|
{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "evict_l2_skip", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hash_elements", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_elements_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_collisions", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chains", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chain_max", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "meta", KSTAT_DATA_UINT64 },
|
|
|
|
{ "pd", KSTAT_DATA_UINT64 },
|
|
|
|
{ "pm", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "c", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_min", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "size", KSTAT_DATA_UINT64 },
|
2016-06-02 07:04:53 +03:00
|
|
|
{ "compressed_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "uncompressed_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "overhead_size", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hdr_size", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "data_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "metadata_size", KSTAT_DATA_UINT64 },
|
2016-07-13 15:42:40 +03:00
|
|
|
{ "dbuf_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "dnode_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "bonus_size", KSTAT_DATA_UINT64 },
|
2020-08-20 20:55:02 +03:00
|
|
|
#if defined(COMPAT_FREEBSD11)
|
|
|
|
{ "other_size", KSTAT_DATA_UINT64 },
|
|
|
|
#endif
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "anon_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "anon_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "anon_metadata", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "anon_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "anon_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mru_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "mru_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_metadata", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mru_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mru_ghost_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "mru_ghost_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_metadata", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mru_ghost_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mfu_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "mfu_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_metadata", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mfu_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mfu_ghost_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "mfu_ghost_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_metadata", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
{ "uncached_size", KSTAT_DATA_UINT64 },
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
{ "uncached_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "uncached_metadata", KSTAT_DATA_UINT64 },
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
{ "uncached_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "uncached_evictable_metadata", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_misses", KSTAT_DATA_UINT64 },
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
{ "l2_prefetch_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_mru_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_mfu_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_bufc_data_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_bufc_metadata_asize", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_feeds", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rw_clash", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "l2_read_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_write_bytes", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_writes_sent", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_done", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_error", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "l2_writes_lock_retry", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_evict_reading", KSTAT_DATA_UINT64 },
|
2014-12-30 06:12:23 +03:00
|
|
|
{ "l2_evict_l1cached", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_free_on_write", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_io_error", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_size", KSTAT_DATA_UINT64 },
|
2013-08-02 00:02:10 +04:00
|
|
|
{ "l2_asize", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hdr_size", KSTAT_DATA_UINT64 },
|
2020-04-10 20:33:35 +03:00
|
|
|
{ "l2_log_blk_writes", KSTAT_DATA_UINT64 },
|
2020-05-08 02:34:03 +03:00
|
|
|
{ "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_log_blk_asize", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_log_blk_count", KSTAT_DATA_UINT64 },
|
2020-04-10 20:33:35 +03:00
|
|
|
{ "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_success", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_unsupported", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_io_errors", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_lowmem", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_size", KSTAT_DATA_UINT64 },
|
2020-05-08 02:34:03 +03:00
|
|
|
{ "l2_rebuild_asize", KSTAT_DATA_UINT64 },
|
2020-04-10 20:33:35 +03:00
|
|
|
{ "l2_rebuild_bufs", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rebuild_log_blks", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "memory_throttle_count", KSTAT_DATA_UINT64 },
|
2011-03-30 05:08:59 +04:00
|
|
|
{ "memory_direct_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "memory_indirect_count", KSTAT_DATA_UINT64 },
|
2017-10-11 01:19:19 +03:00
|
|
|
{ "memory_all_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "memory_free_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "memory_available_bytes", KSTAT_DATA_INT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_no_grow", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_tempreserve", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_loaned_bytes", KSTAT_DATA_UINT64 },
|
2011-12-23 00:20:43 +04:00
|
|
|
{ "arc_prune", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_meta_used", KSTAT_DATA_UINT64 },
|
2016-07-13 15:42:40 +03:00
|
|
|
{ "arc_dnode_limit", KSTAT_DATA_UINT64 },
|
2017-12-21 20:13:06 +03:00
|
|
|
{ "async_upgrade_sync", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "predictive_prefetch", KSTAT_DATA_UINT64 },
|
2015-12-27 00:10:31 +03:00
|
|
|
{ "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "demand_iohit_predictive_prefetch", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prescient_prefetch", KSTAT_DATA_UINT64 },
|
2017-11-16 04:27:01 +03:00
|
|
|
{ "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 },
|
2022-12-22 23:10:24 +03:00
|
|
|
{ "demand_iohit_prescient_prefetch", KSTAT_DATA_UINT64 },
|
2015-07-27 23:17:32 +03:00
|
|
|
{ "arc_need_free", KSTAT_DATA_UINT64 },
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{ "arc_sys_free", KSTAT_DATA_UINT64 },
|
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 20:51:04 +03:00
|
|
|
{ "arc_raw_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "cached_only_in_progress", KSTAT_DATA_UINT64 },
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
{ "abd_chunk_waste_size", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
arc_sums_t arc_sums;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define ARCSTAT_MAX(stat, val) { \
|
|
|
|
uint64_t m; \
|
|
|
|
while ((val) > (m = arc_stats.stat.value.ui64) && \
|
|
|
|
(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
|
|
|
|
continue; \
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We define a macro to allow ARC hits/misses to be easily broken down by
|
|
|
|
* two separate conditions, giving a total of four different subtypes for
|
|
|
|
* each of hits and misses (so eight statistics total).
|
|
|
|
*/
|
|
|
|
#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
|
|
|
|
if (cond1) { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
|
|
|
|
} \
|
|
|
|
} else { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
|
|
|
|
} \
|
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* This macro allows us to use kstats as floating averages. Each time we
|
|
|
|
* update this kstat, we first factor it and the update value by
|
|
|
|
* ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall
|
|
|
|
* average. This macro assumes that integer loads and stores are atomic, but
|
|
|
|
* is not safe for multiple writers updating the kstat in parallel (only the
|
|
|
|
* last writer's update will remain).
|
|
|
|
*/
|
|
|
|
#define ARCSTAT_F_AVG_FACTOR 3
|
|
|
|
#define ARCSTAT_F_AVG(stat, value) \
|
|
|
|
do { \
|
|
|
|
uint64_t x = ARCSTAT(stat); \
|
|
|
|
x = x - x / ARCSTAT_F_AVG_FACTOR + \
|
|
|
|
(value) / ARCSTAT_F_AVG_FACTOR; \
|
|
|
|
ARCSTAT(stat) = x; \
|
|
|
|
} while (0)
|
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
static kstat_t *arc_ksp;
|
2019-10-18 20:23:19 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* There are several ARC variables that are critical to export as kstats --
|
|
|
|
* but we don't want to have to grovel around in the kstat whenever we wish to
|
|
|
|
* manipulate them. For these variables, we therefore define them to be in
|
|
|
|
* terms of the statistic variable. This assures that we are not introducing
|
|
|
|
* the possibility of inconsistency by having shadow copies of the variables,
|
|
|
|
* while still allowing the code to be readable.
|
|
|
|
*/
|
2011-03-24 22:13:55 +03:00
|
|
|
#define arc_tempreserve ARCSTAT(arcstat_tempreserve)
|
|
|
|
#define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes)
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
#define arc_dnode_limit ARCSTAT(arcstat_dnode_limit) /* max size for dnodes */
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
#define arc_need_free ARCSTAT(arcstat_need_free) /* waiting to be evicted */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-18 20:23:19 +03:00
|
|
|
hrtime_t arc_growtime;
|
|
|
|
list_t arc_prune_list;
|
|
|
|
kmutex_t arc_prune_mtx;
|
|
|
|
taskq_t *arc_prune_taskq;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define GHOST_STATE(state) \
|
|
|
|
((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
|
|
|
|
(state) == arc_l2c_only)
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
|
|
|
|
#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
|
|
|
|
#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
|
|
|
|
#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH)
|
2017-11-16 04:27:01 +03:00
|
|
|
#define HDR_PRESCIENT_PREFETCH(hdr) \
|
|
|
|
((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH)
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_COMPRESSION_ENABLED(hdr) \
|
|
|
|
((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE)
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
#define HDR_UNCACHED(hdr) ((hdr)->b_flags & ARC_FLAG_UNCACHED)
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_L2_READING(hdr) \
|
2016-06-02 07:04:53 +03:00
|
|
|
(((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \
|
|
|
|
((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
|
|
|
|
#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
|
|
|
|
#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED)
|
|
|
|
#define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH)
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_ISTYPE_METADATA(hdr) \
|
2016-06-02 07:04:53 +03:00
|
|
|
((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr))
|
|
|
|
|
|
|
|
#define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
|
|
|
|
#define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#define HDR_HAS_RABD(hdr) \
|
|
|
|
(HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \
|
|
|
|
(hdr)->b_crypt_hdr.b_rabd != NULL)
|
|
|
|
#define HDR_ENCRYPTED(hdr) \
|
|
|
|
(HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))
|
|
|
|
#define HDR_AUTHENTICATED(hdr) \
|
|
|
|
(HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* For storing compression mode in b_flags */
|
|
|
|
#define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1)
|
|
|
|
|
|
|
|
#define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \
|
|
|
|
HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
|
|
|
|
#define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
|
|
|
|
HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
|
|
|
|
|
|
|
|
#define ARC_BUF_LAST(buf) ((buf)->b_next == NULL)
|
2016-07-14 00:17:41 +03:00
|
|
|
#define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
|
|
|
|
#define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED)
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Other sizes
|
|
|
|
*/
|
|
|
|
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
#define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Hash table routines
|
|
|
|
*/
|
|
|
|
|
2021-07-01 18:30:31 +03:00
|
|
|
#define BUF_LOCKS 2048
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct buf_hash_table {
|
|
|
|
uint64_t ht_mask;
|
|
|
|
arc_buf_hdr_t **ht_table;
|
2021-07-01 18:30:31 +03:00
|
|
|
kmutex_t ht_locks[BUF_LOCKS] ____cacheline_aligned;
|
2008-11-20 23:01:55 +03:00
|
|
|
} buf_hash_table_t;
|
|
|
|
|
|
|
|
static buf_hash_table_t buf_hash_table;
|
|
|
|
|
|
|
|
#define BUF_HASH_INDEX(spa, dva, birth) \
|
|
|
|
(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
|
2021-07-01 18:30:31 +03:00
|
|
|
#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
|
2010-05-29 00:45:14 +04:00
|
|
|
#define HDR_LOCK(hdr) \
|
|
|
|
(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
uint64_t zfs_crc64_table[256];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
|
2013-08-02 00:02:10 +04:00
|
|
|
#define L2ARC_HEADROOM 2 /* num of writes */
|
2016-02-10 21:42:01 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* If we discover during ARC scan any buffers to be compressed, we boost
|
|
|
|
* our headroom for the next scanning cycle by this percentage multiple.
|
|
|
|
*/
|
|
|
|
#define L2ARC_HEADROOM_BOOST 200
|
2009-02-18 23:51:31 +03:00
|
|
|
#define L2ARC_FEED_SECS 1 /* caching interval secs */
|
|
|
|
#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-11-01 02:04:01 +03:00
|
|
|
/*
|
|
|
|
* We can feed L2ARC from two states of ARC buffers, mru and mfu,
|
|
|
|
* and each of the state has two types: data and metadata.
|
|
|
|
*/
|
|
|
|
#define L2ARC_FEED_TYPES 4
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* L2ARC Performance Tunables */
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */
|
|
|
|
uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */
|
|
|
|
uint64_t l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */
|
|
|
|
uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
|
|
|
|
uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
|
|
|
|
uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */
|
2011-07-08 23:41:57 +04:00
|
|
|
int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
|
|
|
|
int l2arc_feed_again = B_TRUE; /* turbo warmup */
|
2013-07-24 20:57:56 +04:00
|
|
|
int l2arc_norw = B_FALSE; /* no reads during writes */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
static uint_t l2arc_meta_percent = 33; /* limit on headers size */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* L2ARC Internals
|
|
|
|
*/
|
|
|
|
static list_t L2ARC_dev_list; /* device list */
|
|
|
|
static list_t *l2arc_dev_list; /* device list pointer */
|
|
|
|
static kmutex_t l2arc_dev_mtx; /* device list mutex */
|
|
|
|
static l2arc_dev_t *l2arc_dev_last; /* last device used */
|
|
|
|
static list_t L2ARC_free_on_write; /* free after write buf list */
|
|
|
|
static list_t *l2arc_free_on_write; /* free after write list ptr */
|
|
|
|
static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
|
|
|
|
static uint64_t l2arc_ndev; /* number of devices */
|
|
|
|
|
|
|
|
typedef struct l2arc_read_callback {
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_hdr_t *l2rcb_hdr; /* read header */
|
2013-08-02 00:02:10 +04:00
|
|
|
blkptr_t l2rcb_bp; /* original blkptr */
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t l2rcb_zb; /* original bookmark */
|
2013-08-02 00:02:10 +04:00
|
|
|
int l2rcb_flags; /* original flags */
|
2017-06-27 03:32:43 +03:00
|
|
|
abd_t *l2rcb_abd; /* temporary buffer */
|
2008-11-20 23:01:55 +03:00
|
|
|
} l2arc_read_callback_t;
|
|
|
|
|
|
|
|
typedef struct l2arc_data_free {
|
|
|
|
/* protected by l2arc_free_on_write_mtx */
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_t *l2df_abd;
|
2008-11-20 23:01:55 +03:00
|
|
|
size_t l2df_size;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_contents_t l2df_type;
|
2008-11-20 23:01:55 +03:00
|
|
|
list_node_t l2df_list_node;
|
|
|
|
} l2arc_data_free_t;
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
typedef enum arc_fill_flags {
|
|
|
|
ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */
|
|
|
|
ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */
|
|
|
|
ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */
|
|
|
|
ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */
|
|
|
|
ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */
|
|
|
|
} arc_fill_flags_t;
|
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
typedef enum arc_ovf_level {
|
|
|
|
ARC_OVF_NONE, /* ARC within target size. */
|
|
|
|
ARC_OVF_SOME, /* ARC is slightly overflowed. */
|
|
|
|
ARC_OVF_SEVERE /* ARC is severely overflowed. */
|
|
|
|
} arc_ovf_level_t;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static kmutex_t l2arc_feed_thr_lock;
|
|
|
|
static kcondvar_t l2arc_feed_thr_cv;
|
|
|
|
static uint8_t l2arc_thread_exit;
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
static kmutex_t l2arc_rebuild_thr_lock;
|
|
|
|
static kcondvar_t l2arc_rebuild_thr_cv;
|
|
|
|
|
2020-08-12 20:03:24 +03:00
|
|
|
enum arc_hdr_alloc_flags {
|
|
|
|
ARC_HDR_ALLOC_RDATA = 0x1,
|
2021-08-17 19:15:54 +03:00
|
|
|
ARC_HDR_USE_RESERVE = 0x4,
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
ARC_HDR_ALLOC_LINEAR = 0x8,
|
2020-08-12 20:03:24 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
|
2022-04-19 21:49:30 +03:00
|
|
|
static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, const void *, int);
|
|
|
|
static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, const void *);
|
|
|
|
static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, const void *, int);
|
|
|
|
static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, const void *);
|
|
|
|
static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, const void *);
|
|
|
|
static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size,
|
|
|
|
const void *tag);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t);
|
2020-08-12 20:03:24 +03:00
|
|
|
static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
static void arc_hdr_destroy(arc_buf_hdr_t *);
|
2022-12-22 23:10:24 +03:00
|
|
|
static void arc_access(arc_buf_hdr_t *, arc_flags_t, boolean_t);
|
2014-12-06 20:24:32 +03:00
|
|
|
static void arc_buf_watch(arc_buf_t *);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
static void arc_change_state(arc_state_t *, arc_buf_hdr_t *);
|
2014-12-06 20:24:32 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
|
|
|
|
static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
|
|
|
|
static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
|
|
|
|
static void l2arc_read_done(zio_t *);
|
2020-08-19 08:11:34 +03:00
|
|
|
static void l2arc_do_free_on_write(void);
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
static void l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,
|
|
|
|
boolean_t state_only);
|
|
|
|
|
2023-10-31 02:56:04 +03:00
|
|
|
static void arc_prune_async(uint64_t adjust);
|
|
|
|
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
#define l2arc_hdr_arcstats_increment(hdr) \
|
|
|
|
l2arc_hdr_arcstats_update((hdr), B_TRUE, B_FALSE)
|
|
|
|
#define l2arc_hdr_arcstats_decrement(hdr) \
|
|
|
|
l2arc_hdr_arcstats_update((hdr), B_FALSE, B_FALSE)
|
|
|
|
#define l2arc_hdr_arcstats_increment_state(hdr) \
|
|
|
|
l2arc_hdr_arcstats_update((hdr), B_TRUE, B_TRUE)
|
|
|
|
#define l2arc_hdr_arcstats_decrement_state(hdr) \
|
|
|
|
l2arc_hdr_arcstats_update((hdr), B_FALSE, B_TRUE)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-11-11 23:52:16 +03:00
|
|
|
/*
|
|
|
|
* l2arc_exclude_special : A zfs module parameter that controls whether buffers
|
|
|
|
* present on special vdevs are eligibile for caching in L2ARC. If
|
|
|
|
* set to 1, exclude dbufs on special vdevs from being cached to
|
|
|
|
* L2ARC.
|
|
|
|
*/
|
|
|
|
int l2arc_exclude_special = 0;
|
|
|
|
|
2020-09-08 21:44:37 +03:00
|
|
|
/*
|
|
|
|
* l2arc_mfuonly : A ZFS module parameter that controls whether only MFU
|
|
|
|
* metadata and data are cached from ARC into L2ARC.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int l2arc_mfuonly = 0;
|
2020-09-08 21:44:37 +03:00
|
|
|
|
2020-06-09 20:15:08 +03:00
|
|
|
/*
|
|
|
|
* L2ARC TRIM
|
|
|
|
* l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of
|
|
|
|
* the current write size (l2arc_write_max) we should TRIM if we
|
|
|
|
* have filled the device. It is defined as a percentage of the
|
|
|
|
* write size. If set to 100 we trim twice the space required to
|
|
|
|
* accommodate upcoming writes. A minimum of 64MB will be trimmed.
|
|
|
|
* It also enables TRIM of the whole L2ARC device upon creation or
|
|
|
|
* addition to an existing pool or if the header of the device is
|
|
|
|
* invalid upon importing a pool or onlining a cache device. The
|
|
|
|
* default is 0, which disables TRIM on L2ARC altogether as it can
|
|
|
|
* put significant stress on the underlying storage devices. This
|
|
|
|
* will vary depending of how well the specific device handles
|
|
|
|
* these commands.
|
|
|
|
*/
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint64_t l2arc_trim_ahead = 0;
|
2020-06-09 20:15:08 +03:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Performance tuning of L2ARC persistence:
|
|
|
|
*
|
|
|
|
* l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding
|
|
|
|
* an L2ARC device (either at pool import or later) will attempt
|
|
|
|
* to rebuild L2ARC buffer contents.
|
|
|
|
* l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls
|
|
|
|
* whether log blocks are written to the L2ARC device. If the L2ARC
|
|
|
|
* device is less than 1GB, the amount of data l2arc_evict()
|
|
|
|
* evicts is significant compared to the amount of restored L2ARC
|
|
|
|
* data. In this case do not write log blocks in L2ARC in order
|
|
|
|
* not to waste space.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static int l2arc_rebuild_enabled = B_TRUE;
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
static uint64_t l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024;
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/* L2ARC persistence rebuild control routines. */
|
|
|
|
void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen);
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void l2arc_dev_rebuild_thread(void *arg);
|
2020-04-10 20:33:35 +03:00
|
|
|
static int l2arc_rebuild(l2arc_dev_t *dev);
|
|
|
|
|
|
|
|
/* L2ARC persistence read I/O routines. */
|
|
|
|
static int l2arc_dev_hdr_read(l2arc_dev_t *dev);
|
|
|
|
static int l2arc_log_blk_read(l2arc_dev_t *dev,
|
|
|
|
const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,
|
|
|
|
l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
|
|
|
|
zio_t *this_io, zio_t **next_io);
|
|
|
|
static zio_t *l2arc_log_blk_fetch(vdev_t *vd,
|
|
|
|
const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb);
|
|
|
|
static void l2arc_log_blk_fetch_abort(zio_t *zio);
|
|
|
|
|
|
|
|
/* L2ARC persistence block restoration routines. */
|
|
|
|
static void l2arc_log_blk_restore(l2arc_dev_t *dev,
|
2020-10-06 01:29:05 +03:00
|
|
|
const l2arc_log_blk_phys_t *lb, uint64_t lb_asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,
|
|
|
|
l2arc_dev_t *dev);
|
|
|
|
|
|
|
|
/* L2ARC persistence write I/O routines. */
|
2023-06-06 22:32:37 +03:00
|
|
|
static uint64_t l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_write_callback_t *cb);
|
|
|
|
|
2020-06-10 07:24:09 +03:00
|
|
|
/* L2ARC persistence auxiliary routines. */
|
2020-04-10 20:33:35 +03:00
|
|
|
boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,
|
|
|
|
const l2arc_log_blkptr_t *lbp);
|
|
|
|
static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,
|
|
|
|
const arc_buf_hdr_t *ab);
|
|
|
|
boolean_t l2arc_range_check_overlap(uint64_t bottom,
|
|
|
|
uint64_t top, uint64_t check);
|
|
|
|
static void l2arc_blk_fetch_done(zio_t *zio);
|
|
|
|
static inline uint64_t
|
|
|
|
l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev);
|
2017-05-25 21:32:40 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We use Cityhash for this. It's fast, and has good hash properties without
|
|
|
|
* requiring any large static buffers.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static uint64_t
|
2009-02-18 23:51:31 +03:00
|
|
|
buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-05-25 21:32:40 +03:00
|
|
|
return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_EMPTY(hdr) \
|
|
|
|
((hdr)->b_dva.dva_word[0] == 0 && \
|
|
|
|
(hdr)->b_dva.dva_word[1] == 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
#define HDR_EMPTY_OR_LOCKED(hdr) \
|
|
|
|
(HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr)))
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_EQUAL(spa, dva, birth, hdr) \
|
|
|
|
((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
|
|
|
|
((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
|
|
|
|
((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
buf_discard_identity(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
hdr->b_dva.dva_word[0] = 0;
|
|
|
|
hdr->b_dva.dva_word[1] = 0;
|
|
|
|
hdr->b_birth = 0;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_buf_hdr_t *
|
2014-06-06 01:19:08 +04:00
|
|
|
buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
const dva_t *dva = BP_IDENTITY(bp);
|
|
|
|
uint64_t birth = BP_PHYSICAL_BIRTH(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
|
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(hash_lock);
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
|
|
|
|
hdr = hdr->b_hash_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_EQUAL(spa, dva, birth, hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
*lockp = hash_lock;
|
2014-12-06 20:24:32 +03:00
|
|
|
return (hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
*lockp = NULL;
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert an entry into the hash table. If there is already an element
|
|
|
|
* equal to elem in the hash table, then the already existing element
|
|
|
|
* will be returned and the new element will not be inserted.
|
|
|
|
* Otherwise returns NULL.
|
2014-12-30 06:12:23 +03:00
|
|
|
* If lockp == NULL, the caller is assumed to already hold the hash lock.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static arc_buf_hdr_t *
|
2014-12-06 20:24:32 +03:00
|
|
|
buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *fhdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint32_t i;
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
|
|
|
|
ASSERT(hdr->b_birth != 0);
|
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
if (lockp != NULL) {
|
|
|
|
*lockp = hash_lock;
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
} else {
|
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
|
|
|
|
fhdr = fhdr->b_hash_next, i++) {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
|
2014-12-06 20:24:32 +03:00
|
|
|
return (fhdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hdr->b_hash_next = buf_hash_table.ht_table[idx];
|
|
|
|
buf_hash_table.ht_table[idx] = hdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
|
|
|
if (i > 0) {
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_collisions);
|
|
|
|
if (i == 1)
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_chains);
|
|
|
|
|
|
|
|
ARCSTAT_MAX(arcstat_hash_chain_max, i);
|
|
|
|
}
|
2021-06-17 03:19:34 +03:00
|
|
|
uint64_t he = atomic_inc_64_nv(
|
|
|
|
&arc_stats.arcstat_hash_elements.value.ui64);
|
|
|
|
ARCSTAT_MAX(arcstat_hash_elements_max, he);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2014-12-06 20:24:32 +03:00
|
|
|
buf_hash_remove(arc_buf_hdr_t *hdr)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *fhdr, **hdrp;
|
|
|
|
uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(HDR_IN_HASH_TABLE(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hdrp = &buf_hash_table.ht_table[idx];
|
|
|
|
while ((fhdr = *hdrp) != hdr) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(fhdr, !=, NULL);
|
2014-12-06 20:24:32 +03:00
|
|
|
hdrp = &fhdr->b_hash_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-06 20:24:32 +03:00
|
|
|
*hdrp = hdr->b_hash_next;
|
|
|
|
hdr->b_hash_next = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
2021-06-17 03:19:34 +03:00
|
|
|
atomic_dec_64(&arc_stats.arcstat_hash_elements.value.ui64);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (buf_hash_table.ht_table[idx] &&
|
|
|
|
buf_hash_table.ht_table[idx]->b_hash_next == NULL)
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_hash_chains);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global data structures and functions for the buf kmem cache.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static kmem_cache_t *hdr_full_cache;
|
|
|
|
static kmem_cache_t *hdr_l2only_cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
static kmem_cache_t *buf_cache;
|
|
|
|
|
|
|
|
static void
|
|
|
|
buf_fini(void)
|
|
|
|
{
|
2018-02-16 04:53:18 +03:00
|
|
|
#if defined(_KERNEL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_free() in the linux kernel\
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
vmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2021-12-12 18:06:44 +03:00
|
|
|
for (int i = 0; i < BUF_LOCKS; i++)
|
2021-07-01 18:30:31 +03:00
|
|
|
mutex_destroy(BUF_HASH_LOCK(i));
|
2014-12-30 06:12:23 +03:00
|
|
|
kmem_cache_destroy(hdr_full_cache);
|
|
|
|
kmem_cache_destroy(hdr_l2only_cache);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_cache_destroy(buf_cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Constructor callback - called when the cache is empty
|
|
|
|
* and a new buf is requested.
|
|
|
|
*/
|
|
|
|
static int
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused, (void) kmflag;
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(hdr, 0, HDR_FULL_SIZE);
|
2017-11-08 22:12:59 +03:00
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&hdr->b_l1hdr.b_refcnt);
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_link_init(&hdr->b_l1hdr.b_arc_node);
|
2022-12-22 23:10:24 +03:00
|
|
|
list_link_init(&hdr->b_l2hdr.b_l2node);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused, (void) kmflag;
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(hdr, 0, HDR_L2ONLY_SIZE);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static int
|
|
|
|
buf_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused, (void) kmflag;
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_buf_t *buf = vbuf;
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(buf, 0, sizeof (arc_buf_t));
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Destructor callback - called when a cached buf is
|
|
|
|
* no longer required.
|
|
|
|
*/
|
|
|
|
static void
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_dest(void *vbuf, void *unused)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt);
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
hdr_l2only_dest(void *vbuf, void *unused)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static void
|
|
|
|
buf_dest(void *vbuf, void *unused)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
2023-01-09 21:45:17 +03:00
|
|
|
(void) vbuf;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
buf_init(void)
|
|
|
|
{
|
2016-10-01 01:04:21 +03:00
|
|
|
uint64_t *ct = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t hsize = 1ULL << 12;
|
|
|
|
int i, j;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The hash table is big enough to fill all of physical memory
|
2014-08-20 21:09:40 +04:00
|
|
|
* with an average block size of zfs_arc_average_blocksize (default 8K).
|
|
|
|
* By default, the table will take up
|
|
|
|
* totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-10-31 22:24:54 +03:00
|
|
|
while (hsize * zfs_arc_average_blocksize < arc_all_memory())
|
2008-11-20 23:01:55 +03:00
|
|
|
hsize <<= 1;
|
|
|
|
retry:
|
|
|
|
buf_hash_table.ht_mask = hsize - 1;
|
2018-02-16 04:53:18 +03:00
|
|
|
#if defined(_KERNEL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_alloc() in the linux kernel
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
if (buf_hash_table.ht_table == NULL) {
|
|
|
|
ASSERT(hsize > (1ULL << 8));
|
|
|
|
hsize >>= 1;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
|
Remove skc_reclaim, hdr_recl, kmem_cache shrinker
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure. This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function. This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.
The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it. Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.
The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer. Therefore, additionally waking it from the
shrinker calbacks has little benefit.
The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second. Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times. Therefore, this mechanism consumes
significant amounts of CPU time.
The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
1. Return free objects from the magazine layer to the slab layer
2. Return entirely-free slabs to the page layer (i.e. free pages)
These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches. The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.
These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine. So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.
We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected. Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.
This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10576
2020-07-19 19:58:30 +03:00
|
|
|
0, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, 0);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
|
Remove skc_reclaim, hdr_recl, kmem_cache shrinker
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure. This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function. This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.
The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it. Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.
The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer. Therefore, additionally waking it from the
shrinker calbacks has little benefit.
The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second. Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times. Therefore, this mechanism consumes
significant amounts of CPU time.
The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
1. Return free objects from the magazine layer to the slab layer
2. Return entirely-free slabs to the page layer (i.e. free pages)
These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches. The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.
These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine. So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.
We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected. Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.
This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10576
2020-07-19 19:58:30 +03:00
|
|
|
HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL,
|
2014-12-30 06:12:23 +03:00
|
|
|
NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
|
2008-12-03 23:09:06 +03:00
|
|
|
0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < 256; i++)
|
|
|
|
for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
|
|
|
|
*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
|
|
|
|
|
2021-07-01 18:30:31 +03:00
|
|
|
for (i = 0; i < BUF_LOCKS; i++)
|
|
|
|
mutex_init(BUF_HASH_LOCK(i), NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define ARC_MINTIME (hz>>4) /* 62 ms */
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* This is the size that the buf occupies in memory. If the buf is compressed,
|
|
|
|
* it will correspond to the compressed size. You should use this method of
|
|
|
|
* getting the buf size unless you explicitly need the logical size.
|
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
arc_buf_size(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (ARC_BUF_COMPRESSED(buf) ?
|
|
|
|
HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
arc_buf_lsize(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (HDR_GET_LSIZE(buf->b_hdr));
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* This function will return B_TRUE if the buffer is encrypted in memory.
|
|
|
|
* This buffer can be decrypted by calling arc_untransform().
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
arc_is_encrypted(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (ARC_BUF_ENCRYPTED(buf) != 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns B_TRUE if the buffer represents data that has not had its MAC
|
|
|
|
* verified yet.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
arc_is_unauthenticated(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (HDR_NOAUTH(buf->b_hdr) != 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt,
|
|
|
|
uint8_t *iv, uint8_t *mac)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
ASSERT(HDR_PROTECTED(hdr));
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);
|
|
|
|
memcpy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);
|
|
|
|
memcpy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?
|
|
|
|
ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Indicates how this buffer is compressed in memory. If it is not compressed
|
|
|
|
* the value will be ZIO_COMPRESS_OFF. It can be made normally readable with
|
|
|
|
* arc_untransform() as long as it is also unencrypted.
|
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
enum zio_compress
|
|
|
|
arc_get_compression(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (ARC_BUF_COMPRESSED(buf) ?
|
|
|
|
HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Return the compression algorithm used to store this data in the ARC. If ARC
|
|
|
|
* compression is enabled or this is an encrypted block, this will be the same
|
|
|
|
* as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF.
|
|
|
|
*/
|
|
|
|
static inline enum zio_compress
|
|
|
|
arc_hdr_get_compress(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
return (HDR_COMPRESSION_ENABLED(hdr) ?
|
|
|
|
HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF);
|
|
|
|
}
|
|
|
|
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
uint8_t
|
|
|
|
arc_get_complevel(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (buf->b_hdr->b_complevel);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline boolean_t
|
|
|
|
arc_buf_is_shared(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
boolean_t shared = (buf->b_data != NULL &&
|
2016-07-22 18:52:49 +03:00
|
|
|
buf->b_hdr->b_l1hdr.b_pabd != NULL &&
|
|
|
|
abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&
|
|
|
|
buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));
|
2016-06-02 07:04:53 +03:00
|
|
|
IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
|
2023-10-20 22:38:37 +03:00
|
|
|
EQUIV(shared, ARC_BUF_SHARED(buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It would be nice to assert arc_can_share() too, but the "hdr isn't
|
|
|
|
* already being shared" requirement prevents us from doing that.
|
|
|
|
*/
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (shared);
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
/*
|
|
|
|
* Free the checksum associated with this header. If there is no checksum, this
|
|
|
|
* is a no-op.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline void
|
|
|
|
arc_cksum_free(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
|
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
|
|
|
|
kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
|
|
|
|
hdr->b_l1hdr.b_freeze_cksum = NULL;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
/*
|
|
|
|
* Return true iff at least one of the bufs on hdr is not compressed.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* Encrypted buffers count as compressed.
|
2017-04-12 00:56:54 +03:00
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr));
|
2018-08-20 21:03:56 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {
|
|
|
|
if (!ARC_BUF_COMPRESSED(b)) {
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
|
|
|
|
* matches the checksum that is stored in the hdr. If there is no checksum,
|
|
|
|
* or if the buf is compressed, this is a no-op.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_cksum_verify(arc_buf_t *buf)
|
|
|
|
{
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_cksum_t zc;
|
|
|
|
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
2018-08-20 21:03:56 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf))
|
2016-07-14 00:17:41 +03:00
|
|
|
return;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
|
2018-08-20 21:03:56 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
|
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
|
2008-11-20 23:01:55 +03:00
|
|
|
panic("buffer modified while frozen!");
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* This function makes the assumption that data stored in the L2ARC
|
|
|
|
* will be transformed exactly as it is in the main pool. Because of
|
|
|
|
* this we can verify the checksum against the reading process's bp.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static boolean_t
|
|
|
|
arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
|
|
|
|
VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Block pointers always store the checksum for the logical data.
|
|
|
|
* If the block pointer has the gang bit set, then the checksum
|
|
|
|
* it represents is for the reconstituted data and not for an
|
|
|
|
* individual gang member. The zio pipeline, however, must be able to
|
|
|
|
* determine the checksum of each of the gang constituents so it
|
|
|
|
* treats the checksum comparison differently than what we need
|
|
|
|
* for l2arc blocks. This prevents us from using the
|
|
|
|
* zio_checksum_error() interface directly. Instead we must call the
|
|
|
|
* zio_checksum_error_impl() so that we can ensure the checksum is
|
|
|
|
* generated using the correct checksum algorithm and accounts for the
|
|
|
|
* logical I/O size and not just a gang fragment.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
return (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
|
2016-07-22 18:52:49 +03:00
|
|
|
BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,
|
2016-06-02 07:04:53 +03:00
|
|
|
zio->io_offset, NULL) == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
|
|
|
|
* checksum and attaches it to the buf's hdr so that we can ensure that the buf
|
|
|
|
* isn't modified later on. If buf is compressed or there is already a checksum
|
|
|
|
* on the hdr, this is a no-op (we only checksum uncompressed bufs).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_compute(arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2023-01-05 20:15:31 +03:00
|
|
|
mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
|
2018-08-20 21:03:56 +03:00
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!ARC_BUF_ENCRYPTED(buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
|
|
|
|
KM_SLEEP);
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_freeze_cksum);
|
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_watch(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef _KERNEL
|
|
|
|
void
|
|
|
|
arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) sig, (void) unused;
|
2016-12-12 21:46:26 +03:00
|
|
|
panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr);
|
2013-05-17 01:18:06 +04:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_buf_unwatch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch) {
|
2017-04-12 00:56:54 +03:00
|
|
|
ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),
|
2013-05-17 01:18:06 +04:00
|
|
|
PROT_READ | PROT_WRITE));
|
|
|
|
}
|
2021-12-12 18:06:44 +03:00
|
|
|
#else
|
|
|
|
(void) buf;
|
2013-05-17 01:18:06 +04:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_buf_watch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch)
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),
|
2016-06-02 07:04:53 +03:00
|
|
|
PROT_READ));
|
2021-12-12 18:06:44 +03:00
|
|
|
#else
|
|
|
|
(void) buf;
|
2013-05-17 01:18:06 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static arc_buf_contents_t
|
|
|
|
arc_buf_type(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_contents_t type;
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_ISTYPE_METADATA(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
type = ARC_BUFC_METADATA;
|
2014-12-30 06:12:23 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
type = ARC_BUFC_DATA;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
|
|
|
return (type);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
boolean_t
|
|
|
|
arc_is_metadata(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static uint32_t
|
|
|
|
arc_bufc_to_flags(arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
switch (type) {
|
|
|
|
case ARC_BUFC_DATA:
|
|
|
|
/* metadata field is 0 if buffer contains normal data */
|
|
|
|
return (0);
|
|
|
|
case ARC_BUFC_METADATA:
|
|
|
|
return (ARC_FLAG_BUFC_METADATA);
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
panic("undefined ARC buffer type!");
|
|
|
|
return ((uint32_t)-1);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_buf_thaw(arc_buf_t *buf)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_cksum_verify(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
2018-08-20 21:03:56 +03:00
|
|
|
* Compressed buffers do not manipulate the b_freeze_cksum.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2018-08-20 21:03:56 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf))
|
2016-07-11 20:45:52 +03:00
|
|
|
return;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
arc_cksum_free(hdr);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_freeze(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
2018-08-20 21:03:56 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf))
|
2016-07-11 20:45:52 +03:00
|
|
|
return;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-08-20 21:03:56 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(buf->b_hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_compute(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
|
|
|
|
* the following functions should be used to ensure that the flags are
|
|
|
|
* updated in a thread-safe way. When manipulating the flags either
|
|
|
|
* the hash_lock must be held or the hdr must be undiscoverable. This
|
|
|
|
* ensures that we're not racing with any other threads when updating
|
|
|
|
* the flags.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
|
|
|
|
{
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_flags |= flags;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
|
|
|
|
{
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_flags &= ~flags;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting the compression bits in the arc_buf_hdr_t's b_flags is
|
|
|
|
* done in a special way since we have to clear and set bits
|
|
|
|
* at the same time. Consumers that wish to set the compression bits
|
|
|
|
* must use this function to ensure that the flags are updated in
|
|
|
|
* thread-safe manner.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
|
|
|
|
{
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Holes and embedded blocks will always have a psize = 0 so
|
|
|
|
* we ignore the compression of the blkptr and set the
|
|
|
|
* want to uncompress them. Mark them as uncompressed.
|
|
|
|
*/
|
|
|
|
if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
|
|
|
|
ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
|
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
|
|
|
|
ASSERT(HDR_COMPRESSION_ENABLED(hdr));
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
HDR_SET_COMPRESS(hdr, cmp);
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Looks for another buf on the same hdr which has the data decompressed, copies
|
|
|
|
* from it, and returns true. If no such buf exists, returns false.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
boolean_t copied = B_FALSE;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(buf));
|
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;
|
2016-07-14 00:17:41 +03:00
|
|
|
from = from->b_next) {
|
|
|
|
/* can't use our own data buffer */
|
|
|
|
if (from == buf) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ARC_BUF_COMPRESSED(from)) {
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(buf->b_data, from->b_data, arc_buf_size(buf));
|
2016-07-14 00:17:41 +03:00
|
|
|
copied = B_TRUE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* There were no decompressed bufs, so there should not be a
|
|
|
|
* checksum on the hdr either.
|
|
|
|
*/
|
2019-07-03 23:01:54 +03:00
|
|
|
if (zfs_flags & ZFS_DEBUG_MODIFY)
|
|
|
|
EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
return (copied);
|
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Allocates an ARC buf header that's in an evicted & L2-cached state.
|
|
|
|
* This is used during l2arc reconstruction to make empty ARC buffers
|
|
|
|
* which circumvent the regular disk->arc->l2arc path and instead come
|
|
|
|
* into being in the reverse order, i.e. l2arc->arc.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static arc_buf_hdr_t *
|
2020-04-10 20:33:35 +03:00
|
|
|
arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev,
|
|
|
|
dva_t dva, uint64_t daddr, int32_t psize, uint64_t birth,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
enum zio_compress compress, uint8_t complevel, boolean_t protected,
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
boolean_t prefetch, arc_state_type_t arcs_state)
|
2020-04-10 20:33:35 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
|
|
|
|
ASSERT(size != 0);
|
|
|
|
hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);
|
|
|
|
hdr->b_birth = birth;
|
|
|
|
hdr->b_type = type;
|
|
|
|
hdr->b_flags = 0;
|
|
|
|
arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);
|
|
|
|
HDR_SET_LSIZE(hdr, size);
|
|
|
|
HDR_SET_PSIZE(hdr, psize);
|
|
|
|
arc_hdr_set_compress(hdr, compress);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
hdr->b_complevel = complevel;
|
2020-04-10 20:33:35 +03:00
|
|
|
if (protected)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
|
|
|
|
if (prefetch)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
|
|
|
|
hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa);
|
|
|
|
|
|
|
|
hdr->b_dva = dva;
|
|
|
|
|
|
|
|
hdr->b_l2hdr.b_dev = dev;
|
|
|
|
hdr->b_l2hdr.b_daddr = daddr;
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
hdr->b_l2hdr.b_arcs_state = arcs_state;
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
return (hdr);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_hdr_size(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
uint64_t size;
|
|
|
|
|
|
|
|
if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
HDR_GET_PSIZE(hdr) > 0) {
|
|
|
|
size = HDR_GET_PSIZE(hdr);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
|
|
|
|
size = HDR_GET_LSIZE(hdr);
|
|
|
|
}
|
|
|
|
return (size);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
uint64_t csize;
|
|
|
|
uint64_t lsize = HDR_GET_LSIZE(hdr);
|
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
void *tmpbuf = NULL;
|
|
|
|
abd_t *abd = hdr->b_l1hdr.b_pabd;
|
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(HDR_AUTHENTICATED(hdr));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The MAC is calculated on the compressed data that is stored on disk.
|
|
|
|
* However, if compressed arc is disabled we will only have the
|
|
|
|
* decompressed data available to us now. Compress it into a temporary
|
|
|
|
* abd so we can verify the MAC. The performance overhead of this will
|
|
|
|
* be relatively low, since most objects in an encrypted objset will
|
|
|
|
* be encrypted (instead of authenticated) anyway.
|
|
|
|
*/
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!HDR_COMPRESSION_ENABLED(hdr)) {
|
2023-02-28 01:41:02 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
csize = zio_compress_data(HDR_GET_COMPRESS(hdr),
|
2023-02-28 01:41:02 +03:00
|
|
|
hdr->b_l1hdr.b_pabd, &tmpbuf, lsize, hdr->b_complevel);
|
|
|
|
ASSERT3P(tmpbuf, !=, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(csize, <=, psize);
|
2023-02-28 01:41:02 +03:00
|
|
|
abd = abd_get_from_buf(tmpbuf, lsize);
|
|
|
|
abd_take_ownership_of_buf(abd, B_TRUE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd_zero_off(abd, csize, psize - csize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Authentication is best effort. We authenticate whenever the key is
|
|
|
|
* available. If we succeed we clear ARC_FLAG_NOAUTH.
|
|
|
|
*/
|
|
|
|
if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) {
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT3U(lsize, ==, psize);
|
|
|
|
ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd,
|
|
|
|
psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
|
|
|
|
} else {
|
|
|
|
ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize,
|
|
|
|
hdr->b_crypt_hdr.b_mac);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ret == 0)
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH);
|
|
|
|
else if (ret != ENOENT)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
if (tmpbuf != NULL)
|
|
|
|
abd_free(abd);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error:
|
|
|
|
if (tmpbuf != NULL)
|
|
|
|
abd_free(abd);
|
|
|
|
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function will take a header that only has raw encrypted data in
|
|
|
|
* b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in
|
|
|
|
* b_l1hdr.b_pabd. If designated in the header flags, this function will
|
|
|
|
* also decompress the data.
|
|
|
|
*/
|
|
|
|
static int
|
2018-05-03 01:36:20 +03:00
|
|
|
arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
abd_t *cabd = NULL;
|
|
|
|
void *tmp = NULL;
|
|
|
|
boolean_t no_crypt = B_FALSE;
|
|
|
|
boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
|
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(HDR_ENCRYPTED(hdr));
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2018-05-03 01:36:20 +03:00
|
|
|
ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot,
|
|
|
|
B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv,
|
|
|
|
hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
hdr->b_crypt_hdr.b_rabd, &no_crypt);
|
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
|
|
|
|
if (no_crypt) {
|
|
|
|
abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd,
|
|
|
|
HDR_GET_PSIZE(hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this header has disabled arc compression but the b_pabd is
|
|
|
|
* compressed after decrypting it, we need to decompress the newly
|
|
|
|
* decrypted data.
|
|
|
|
*/
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!HDR_COMPRESSION_ENABLED(hdr)) {
|
|
|
|
/*
|
|
|
|
* We want to make sure that we are correctly honoring the
|
|
|
|
* zfs_abd_scatter_enabled setting, so we allocate an abd here
|
|
|
|
* and then loan a buffer from it, rather than allocating a
|
|
|
|
* linear buffer and wrapping it in an abd later.
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr));
|
|
|
|
|
|
|
|
ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),
|
|
|
|
hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr),
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
HDR_GET_LSIZE(hdr), &hdr->b_complevel);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0) {
|
|
|
|
abd_return_buf(cabd, tmp, arc_hdr_size(hdr));
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
|
|
|
|
arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
hdr->b_l1hdr.b_pabd = cabd;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error:
|
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
|
|
|
if (cabd != NULL)
|
|
|
|
arc_free_data_buf(hdr, cabd, arc_hdr_size(hdr), hdr);
|
|
|
|
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is called during arc_buf_fill() to prepare the header's
|
|
|
|
* abd plaintext pointer for use. This involves authenticated protected
|
|
|
|
* data and decrypting encrypted data into the plaintext abd.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa,
|
2018-05-03 01:36:20 +03:00
|
|
|
const zbookmark_phys_t *zb, boolean_t noauth)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(HDR_PROTECTED(hdr));
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
|
|
|
if (HDR_NOAUTH(hdr) && !noauth) {
|
|
|
|
/*
|
|
|
|
* The caller requested authenticated data but our data has
|
|
|
|
* not been authenticated yet. Verify the MAC now if we can.
|
|
|
|
*/
|
2018-05-03 01:36:20 +03:00
|
|
|
ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
} else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) {
|
|
|
|
/*
|
|
|
|
* If we only have the encrypted version of the data, but the
|
|
|
|
* unencrypted version was requested we take this opportunity
|
|
|
|
* to store the decrypted version in the header for future use.
|
|
|
|
*/
|
2018-05-03 01:36:20 +03:00
|
|
|
ret = arc_hdr_decrypt(hdr, spa, zb);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error:
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is used by the dbuf code to decrypt bonus buffers in place.
|
|
|
|
* The dbuf code itself doesn't have any locking for decrypting a shared dnode
|
|
|
|
* block, so we use the hash lock here to protect against concurrent calls to
|
|
|
|
* arc_buf_fill().
|
|
|
|
*/
|
|
|
|
static void
|
2021-12-12 18:06:44 +03:00
|
|
|
arc_buf_untransform_in_place(arc_buf_t *buf)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
ASSERT(HDR_ENCRYPTED(hdr));
|
|
|
|
ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
|
|
|
|
zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data,
|
|
|
|
arc_buf_size(buf));
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
|
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Given a buf that has a data buffer attached to it, this function will
|
|
|
|
* efficiently fill the buf with data of the specified compression setting from
|
|
|
|
* the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
|
|
|
|
* are already sharing a data buf, no copy is performed.
|
|
|
|
*
|
|
|
|
* If the buf is marked as compressed but uncompressed data was requested, this
|
|
|
|
* will allocate a new data buffer for the buf, remove that flag, and fill the
|
|
|
|
* buf with uncompressed data. You can't request a compressed buf on a hdr with
|
|
|
|
* uncompressed data, and (since we haven't added support for it yet) if you
|
|
|
|
* want compressed data your buf must already be marked as compressed and have
|
|
|
|
* the correct-sized data buffer.
|
|
|
|
*/
|
|
|
|
static int
|
2018-05-03 01:36:20 +03:00
|
|
|
arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,
|
|
|
|
arc_fill_flags_t flags)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int error = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t hdr_compressed =
|
|
|
|
(arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
|
|
|
|
boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0;
|
|
|
|
boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf));
|
2016-07-14 00:17:41 +03:00
|
|
|
IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
IMPLY(encrypted, HDR_ENCRYPTED(hdr));
|
|
|
|
IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf));
|
|
|
|
IMPLY(encrypted, ARC_BUF_COMPRESSED(buf));
|
2023-10-20 22:38:37 +03:00
|
|
|
IMPLY(encrypted, !arc_buf_is_shared(buf));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the caller wanted encrypted data we just need to copy it from
|
|
|
|
* b_rabd and potentially byteswap it. We won't be able to do any
|
|
|
|
* further transforms on it.
|
|
|
|
*/
|
|
|
|
if (encrypted) {
|
|
|
|
ASSERT(HDR_HAS_RABD(hdr));
|
|
|
|
abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd,
|
|
|
|
HDR_GET_PSIZE(hdr));
|
|
|
|
goto byteswap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Adjust encrypted and authenticated headers to accommodate
|
2018-06-28 19:20:34 +03:00
|
|
|
* the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are
|
|
|
|
* allowed to fail decryption due to keys not being loaded
|
|
|
|
* without being marked as an IO error.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*/
|
|
|
|
if (HDR_PROTECTED(hdr)) {
|
|
|
|
error = arc_fill_hdr_crypt(hdr, hash_lock, spa,
|
2018-05-03 01:36:20 +03:00
|
|
|
zb, !!(flags & ARC_FILL_NOAUTH));
|
2018-06-28 19:20:34 +03:00
|
|
|
if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) {
|
|
|
|
return (error);
|
|
|
|
} else if (error != 0) {
|
2018-06-06 20:16:41 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_enter(hash_lock);
|
2018-05-01 21:24:20 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
|
2018-06-06 20:16:41 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
return (error);
|
2018-05-01 21:24:20 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There is a special case here for dnode blocks which are
|
|
|
|
* decrypting their bonus buffers. These blocks may request to
|
|
|
|
* be decrypted in-place. This is necessary because there may
|
|
|
|
* be many dnodes pointing into this buffer and there is
|
|
|
|
* currently no method to synchronize replacing the backing
|
|
|
|
* b_data buffer and updating all of the pointers. Here we use
|
|
|
|
* the hash lock to ensure there are no races. If the need
|
|
|
|
* arises for other types to be decrypted in-place, they must
|
|
|
|
* add handling here as well.
|
|
|
|
*/
|
|
|
|
if ((flags & ARC_FILL_IN_PLACE) != 0) {
|
|
|
|
ASSERT(!hdr_compressed);
|
|
|
|
ASSERT(!compressed);
|
|
|
|
ASSERT(!encrypted);
|
|
|
|
|
|
|
|
if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) {
|
|
|
|
ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_enter(hash_lock);
|
2021-12-12 18:06:44 +03:00
|
|
|
arc_buf_untransform_in_place(buf);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
/* Compute the hdr's checksum if necessary */
|
|
|
|
arc_cksum_compute(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
if (hdr_compressed == compressed) {
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
|
|
|
ASSERT(arc_buf_is_shared(buf));
|
|
|
|
} else {
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_buf_size(buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(hdr_compressed);
|
|
|
|
ASSERT(!compressed);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* If the buf is sharing its data with the hdr, unlink it and
|
|
|
|
* allocate a new data buffer for the buf.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(ARC_BUF_COMPRESSED(buf));
|
|
|
|
|
2019-09-03 03:56:41 +03:00
|
|
|
/* We need to give the buf its own b_data */
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
|
2016-07-11 20:45:52 +03:00
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Previously overhead was 0; just add new overhead */
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
|
2016-07-14 00:17:41 +03:00
|
|
|
} else if (ARC_BUF_COMPRESSED(buf)) {
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* We need to reallocate the buf's b_data */
|
|
|
|
arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
|
|
|
|
buf);
|
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
|
|
|
|
|
|
|
|
/* We increased the size of b_data; update overhead */
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size,
|
|
|
|
HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Regardless of the buf's previous compression settings, it
|
|
|
|
* should not be compressed at the end of this function.
|
|
|
|
*/
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try copying the data from another buf which already has a
|
|
|
|
* decompressed version. If that's not possible, it's time to
|
|
|
|
* bite the bullet and decompress the data from the hdr.
|
|
|
|
*/
|
|
|
|
if (arc_buf_try_copy_decompressed_data(buf)) {
|
|
|
|
/* Skip byteswapping and checksumming (already done) */
|
|
|
|
return (0);
|
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
|
2016-07-22 18:52:49 +03:00
|
|
|
hdr->b_l1hdr.b_pabd, buf->b_data,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr),
|
|
|
|
&hdr->b_complevel);
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Absent hardware errors or software bugs, this should
|
|
|
|
* be impossible, but log it anyway so we can debug it.
|
|
|
|
*/
|
|
|
|
if (error != 0) {
|
|
|
|
zfs_dbgmsg(
|
2019-04-05 04:57:06 +03:00
|
|
|
"hdr %px, compress %d, psize %d, lsize %d",
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
hdr, arc_hdr_get_compress(hdr),
|
2016-07-14 00:17:41 +03:00
|
|
|
HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
|
2018-06-06 20:16:41 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_enter(hash_lock);
|
2018-05-01 21:24:20 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
|
2018-06-06 20:16:41 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2016-07-14 00:17:41 +03:00
|
|
|
return (SET_ERROR(EIO));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
byteswap:
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Byteswap the buf's data if necessary */
|
2016-06-02 07:04:53 +03:00
|
|
|
if (bswap != DMU_BSWAP_NUMFUNCS) {
|
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr));
|
|
|
|
ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
|
|
|
|
dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/* Compute the hdr's checksum if necessary */
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_compute(buf);
|
2016-07-14 00:17:41 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* If this function is being called to decrypt an encrypted buffer or verify an
|
|
|
|
* authenticated one, the key must be loaded and a mapping must be made
|
|
|
|
* available in the keystore via spa_keystore_create_mapping() or one of its
|
|
|
|
* callers.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int
|
2018-03-31 21:12:51 +03:00
|
|
|
arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,
|
|
|
|
boolean_t in_place)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
2018-03-31 21:12:51 +03:00
|
|
|
int ret;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_fill_flags_t flags = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (in_place)
|
|
|
|
flags |= ARC_FILL_IN_PLACE;
|
|
|
|
|
2018-05-03 01:36:20 +03:00
|
|
|
ret = arc_buf_fill(buf, spa, zb, flags);
|
2018-03-31 21:12:51 +03:00
|
|
|
if (ret == ECKSUM) {
|
|
|
|
/*
|
|
|
|
* Convert authentication and decryption errors to EIO
|
|
|
|
* (and generate an ereport) before leaving the ARC.
|
|
|
|
*/
|
|
|
|
ret = SET_ERROR(EIO);
|
2023-03-29 02:51:58 +03:00
|
|
|
spa_log_error(spa, zb, &buf->b_hdr->b_birth);
|
2020-09-01 05:35:11 +03:00
|
|
|
(void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION,
|
2020-09-04 20:34:28 +03:00
|
|
|
spa, NULL, zb, NULL, 0);
|
2018-03-31 21:12:51 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (ret);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Increment the amount of evictable space in the arc_state_t's refcount.
|
|
|
|
* We account for the space used by the hdr and the arc buf individually
|
|
|
|
* so that we can add and remove them from the refcount individually.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
|
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
if (GHOST_STATE(state)) {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(&state->arcs_esize[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (HDR_HAS_RABD(hdr)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(&state->arcs_esize[type],
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
HDR_GET_PSIZE(hdr), hdr);
|
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
|
|
|
buf = buf->b_next) {
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decrement the amount of evictable space in the arc_state_t's refcount.
|
|
|
|
* We account for the space used by the hdr and the arc buf individually
|
|
|
|
* so that we can add and remove them from the refcount individually.
|
|
|
|
*/
|
|
|
|
static void
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
if (GHOST_STATE(state)) {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (HDR_HAS_RABD(hdr)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
HDR_GET_PSIZE(hdr), hdr);
|
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
|
|
|
buf = buf->b_next) {
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a reference to this hdr indicating that someone is actively
|
|
|
|
* referencing that memory. When the refcount transitions from 0 to 1,
|
|
|
|
* we remove it from the respective arc_state_t list to indicate that
|
|
|
|
* it is not evictable.
|
|
|
|
*/
|
|
|
|
static void
|
2022-04-19 21:38:30 +03:00
|
|
|
add_reference(arc_buf_hdr_t *hdr, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2019-03-16 00:17:38 +03:00
|
|
|
if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) {
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(state == arc_anon);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-09-26 20:29:26 +03:00
|
|
|
if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
|
2022-12-22 23:10:24 +03:00
|
|
|
state != arc_anon && state != arc_l2c_only) {
|
2014-12-30 06:12:23 +03:00
|
|
|
/* We don't use the L2-only state list. */
|
2022-12-22 23:10:24 +03:00
|
|
|
multilist_remove(&state->arcs_list[arc_buf_type(hdr)], hdr);
|
|
|
|
arc_evictable_space_decrement(hdr, state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Remove a reference from this hdr. When the reference transitions from
|
|
|
|
* 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
|
|
|
|
* list making it eligible for eviction.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
2022-12-22 23:10:24 +03:00
|
|
|
remove_reference(arc_buf_hdr_t *hdr, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int cnt;
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(state == arc_anon || MUTEX_HELD(HDR_LOCK(hdr)));
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
ASSERT(!GHOST_STATE(state)); /* arc_l2c_only counts as a ghost. */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if ((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) != 0)
|
|
|
|
return (cnt);
|
|
|
|
|
|
|
|
if (state == arc_anon) {
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (state == arc_uncached && !HDR_PREFETCH(hdr)) {
|
|
|
|
arc_change_state(arc_anon, hdr);
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr);
|
|
|
|
arc_evictable_space_increment(hdr, state);
|
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
/*
|
|
|
|
* Returns detailed information about a specific arc buffer. When the
|
|
|
|
* state_index argument is set the function will calculate the arc header
|
|
|
|
* list position for its arc state. Since this requires a linear traversal
|
|
|
|
* callers are strongly encourage not to do this. However, it can be helpful
|
|
|
|
* for targeted analysis so the functionality is provided.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) state_index;
|
2013-10-03 04:11:19 +04:00
|
|
|
arc_buf_hdr_t *hdr = ab->b_hdr;
|
2014-12-30 06:12:23 +03:00
|
|
|
l1arc_buf_hdr_t *l1hdr = NULL;
|
|
|
|
l2arc_buf_hdr_t *l2hdr = NULL;
|
|
|
|
arc_state_t *state = NULL;
|
|
|
|
|
2016-07-10 17:09:02 +03:00
|
|
|
memset(abi, 0, sizeof (arc_buf_info_t));
|
|
|
|
|
|
|
|
if (hdr == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
abi->abi_flags = hdr->b_flags;
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
l1hdr = &hdr->b_l1hdr;
|
|
|
|
state = l1hdr->b_state;
|
|
|
|
}
|
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
l2hdr = &hdr->b_l2hdr;
|
2013-10-03 04:11:19 +04:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (l1hdr) {
|
2023-10-06 18:56:17 +03:00
|
|
|
abi->abi_bufcnt = 0;
|
|
|
|
for (arc_buf_t *buf = l1hdr->b_buf; buf; buf = buf->b_next)
|
|
|
|
abi->abi_bufcnt++;
|
2014-12-30 06:12:23 +03:00
|
|
|
abi->abi_access = l1hdr->b_arc_access;
|
|
|
|
abi->abi_mru_hits = l1hdr->b_mru_hits;
|
|
|
|
abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits;
|
|
|
|
abi->abi_mfu_hits = l1hdr->b_mfu_hits;
|
|
|
|
abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits;
|
2018-10-01 20:42:05 +03:00
|
|
|
abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (l2hdr) {
|
|
|
|
abi->abi_l2arc_dattr = l2hdr->b_daddr;
|
|
|
|
abi->abi_l2arc_hits = l2hdr->b_hits;
|
|
|
|
}
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;
|
2014-12-30 06:12:23 +03:00
|
|
|
abi->abi_state_contents = arc_buf_type(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
abi->abi_size = arc_hdr_size(hdr);
|
2013-10-03 04:11:19 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Move the supplied buffer to the indicated state. The hash lock
|
2008-11-20 23:01:55 +03:00
|
|
|
* for the buffer must be held by the caller.
|
|
|
|
*/
|
|
|
|
static void
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_state_t *old_state;
|
|
|
|
int64_t refcnt;
|
2016-06-02 07:04:53 +03:00
|
|
|
boolean_t update_old, update_new;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We almost always have an L1 hdr here, since we call arc_hdr_realloc()
|
|
|
|
* in arc_read() when bringing a buffer out of the L2ARC. However, the
|
|
|
|
* L1 hdr doesn't always exist when we change state to arc_anon before
|
|
|
|
* destroying a header, in which case reallocating to add the L1 hdr is
|
|
|
|
* pointless.
|
|
|
|
*/
|
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
old_state = hdr->b_l1hdr.b_state;
|
2018-10-01 20:42:05 +03:00
|
|
|
refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt);
|
2023-10-06 18:56:17 +03:00
|
|
|
update_old = (hdr->b_l1hdr.b_buf != NULL ||
|
|
|
|
hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
|
|
|
|
IMPLY(GHOST_STATE(old_state), hdr->b_l1hdr.b_buf == NULL);
|
|
|
|
IMPLY(GHOST_STATE(new_state), hdr->b_l1hdr.b_buf == NULL);
|
2023-10-06 18:56:17 +03:00
|
|
|
IMPLY(old_state == arc_anon, hdr->b_l1hdr.b_buf == NULL ||
|
|
|
|
ARC_BUF_LAST(hdr->b_l1hdr.b_buf));
|
2014-12-30 06:12:23 +03:00
|
|
|
} else {
|
|
|
|
old_state = arc_l2c_only;
|
|
|
|
refcnt = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
update_old = B_FALSE;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
update_new = update_old;
|
2022-12-22 23:10:24 +03:00
|
|
|
if (GHOST_STATE(old_state))
|
|
|
|
update_old = B_TRUE;
|
|
|
|
if (GHOST_STATE(new_state))
|
|
|
|
update_new = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
ASSERT3P(new_state, !=, old_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this buffer is evictable, transfer it from the
|
|
|
|
* old state list to the new state list.
|
|
|
|
*/
|
|
|
|
if (refcnt == 0) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (old_state != arc_anon && old_state != arc_l2c_only) {
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
/* remove_reference() saves on insert. */
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
multilist_remove(&old_state->arcs_list[type],
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
hdr);
|
|
|
|
arc_evictable_space_decrement(hdr, old_state);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
if (new_state != arc_anon && new_state != arc_l2c_only) {
|
|
|
|
/*
|
|
|
|
* An L1 header always exists here, since if we're
|
|
|
|
* moving to some L1-cached state (i.e. not l2c_only or
|
|
|
|
* anonymous), we realloc the header to add an L1hdr
|
|
|
|
* beforehand.
|
|
|
|
*/
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
multilist_insert(&new_state->arcs_list[type], hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_evictable_space_increment(hdr, new_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_EMPTY(hdr));
|
2014-12-06 20:24:32 +03:00
|
|
|
if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
|
|
|
|
buf_hash_remove(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/* adjust state sizes (ignore arc_l2c_only) */
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (update_new && new_state != arc_l2c_only) {
|
2015-06-27 01:14:45 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
if (GHOST_STATE(new_state)) {
|
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When moving a header to a ghost state, we first
|
2023-10-06 18:56:17 +03:00
|
|
|
* remove all arc buffers. Thus, we'll have no arc
|
|
|
|
* buffer to use for the reference. As a result, we
|
|
|
|
* use the arc header pointer for the reference.
|
2015-06-27 01:14:45 +03:00
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_add_many(
|
|
|
|
&new_state->arcs_size[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2015-06-27 01:14:45 +03:00
|
|
|
} else {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Each individual buffer holds a unique reference,
|
|
|
|
* thus we must remove each of these references one
|
|
|
|
* at a time.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
2015-06-27 01:14:45 +03:00
|
|
|
buf = buf->b_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When the arc_buf_t is sharing the data
|
|
|
|
* block with the hdr, the owner of the
|
|
|
|
* reference belongs to the hdr. Only
|
|
|
|
* add to the refcount if the arc_buf_t is
|
|
|
|
* not shared.
|
|
|
|
*/
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&new_state->arcs_size[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&new_state->arcs_size[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (HDR_HAS_RABD(hdr)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&new_state->arcs_size[type],
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
HDR_GET_PSIZE(hdr), hdr);
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (update_old && old_state != arc_l2c_only) {
|
2015-06-27 01:14:45 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
if (GHOST_STATE(old_state)) {
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2015-06-27 01:14:45 +03:00
|
|
|
/*
|
|
|
|
* When moving a header off of a ghost state,
|
2016-06-02 07:04:53 +03:00
|
|
|
* the header will not contain any arc buffers.
|
|
|
|
* We use the arc header pointer for the reference
|
|
|
|
* which is exactly what we did when we put the
|
|
|
|
* header on the ghost state.
|
2015-06-27 01:14:45 +03:00
|
|
|
*/
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
|
|
|
&old_state->arcs_size[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2015-06-27 01:14:45 +03:00
|
|
|
} else {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Each individual buffer holds a unique reference,
|
|
|
|
* thus we must remove each of these references one
|
|
|
|
* at a time.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
2015-06-27 01:14:45 +03:00
|
|
|
buf = buf->b_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When the arc_buf_t is sharing the data
|
|
|
|
* block with the hdr, the owner of the
|
|
|
|
* reference belongs to the hdr. Only
|
|
|
|
* add to the refcount if the arc_buf_t is
|
|
|
|
* not shared.
|
|
|
|
*/
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&old_state->arcs_size[type],
|
|
|
|
arc_buf_size(buf), buf);
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_pabd != NULL ||
|
|
|
|
HDR_HAS_RABD(hdr));
|
|
|
|
|
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&old_state->arcs_size[type],
|
|
|
|
arc_hdr_size(hdr), hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (HDR_HAS_RABD(hdr)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&old_state->arcs_size[type],
|
|
|
|
HDR_GET_PSIZE(hdr), hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-06-27 01:14:45 +03:00
|
|
|
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_state = new_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) && new_state != arc_l2c_only) {
|
|
|
|
l2arc_hdr_arcstats_decrement_state(hdr);
|
|
|
|
hdr->b_l2hdr.b_arcs_state = new_state->arcs_state;
|
|
|
|
l2arc_hdr_arcstats_increment_state(hdr);
|
|
|
|
}
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_data_size, space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_metadata_size, space);
|
2014-02-04 00:41:47 +04:00
|
|
|
break;
|
2016-07-13 15:42:40 +03:00
|
|
|
case ARC_SPACE_BONUS:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_bonus_size, space);
|
2016-07-13 15:42:40 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_DNODE:
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ARCSTAT_INCR(arcstat_dnode_size, space);
|
2016-07-13 15:42:40 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_DBUF:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_dbuf_size, space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
2021-06-17 03:19:34 +03:00
|
|
|
aggsum_add(&arc_sums.arcstat_l2_hdr_size, space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
case ARC_SPACE_ABD_CHUNK_WASTE:
|
|
|
|
/*
|
|
|
|
* Note: this includes space wasted by all scatter ABD's, not
|
|
|
|
* just those allocated by the ARC. But the vast majority of
|
|
|
|
* scatter ABD's come from the ARC, because other users are
|
|
|
|
* very short-lived.
|
|
|
|
*/
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_abd_chunk_waste_size, space);
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
|
|
|
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ARCSTAT_INCR(arcstat_meta_used, space);
|
2014-02-04 00:41:47 +04:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
aggsum_add(&arc_sums.arcstat_size, space);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_data_size, -space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_metadata_size, -space);
|
2014-02-04 00:41:47 +04:00
|
|
|
break;
|
2016-07-13 15:42:40 +03:00
|
|
|
case ARC_SPACE_BONUS:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_bonus_size, -space);
|
2016-07-13 15:42:40 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_DNODE:
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ARCSTAT_INCR(arcstat_dnode_size, -space);
|
2016-07-13 15:42:40 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_DBUF:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_dbuf_size, -space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, -space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
2021-06-17 03:19:34 +03:00
|
|
|
aggsum_add(&arc_sums.arcstat_l2_hdr_size, -space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
case ARC_SPACE_ABD_CHUNK_WASTE:
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_INCR(arcstat_abd_chunk_waste_size, -space);
|
Include scatter_chunk_waste in arc_size
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`. This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.
If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.
However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large. On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.
This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`. As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10701
2020-08-18 06:04:04 +03:00
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)
|
|
|
|
ARCSTAT_INCR(arcstat_meta_used, -space);
|
2014-02-04 00:41:47 +04:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
ASSERT(aggsum_compare(&arc_sums.arcstat_size, space) >= 0);
|
|
|
|
aggsum_add(&arc_sums.arcstat_size, -space);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Given a hdr and a buf, returns whether that buf can share its b_data buffer
|
2016-07-22 18:52:49 +03:00
|
|
|
* with the hdr's b_pabd.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
static boolean_t
|
|
|
|
arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The criteria for sharing a hdr's data are:
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* 1. the buffer is not encrypted
|
|
|
|
* 2. the hdr's compression matches the buf's compression
|
|
|
|
* 3. the hdr doesn't need to be byteswapped
|
|
|
|
* 4. the hdr isn't already being shared
|
|
|
|
* 5. the buf is either compressed or it is the last buf in the hdr list
|
2016-07-14 00:17:41 +03:00
|
|
|
*
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* Criterion #5 maintains the invariant that shared uncompressed
|
2016-07-14 00:17:41 +03:00
|
|
|
* bufs must be the final buf in the hdr's b_buf list. Reading this, you
|
|
|
|
* might ask, "if a compressed buf is allocated first, won't that be the
|
|
|
|
* last thing in the list?", but in that case it's impossible to create
|
|
|
|
* a shared uncompressed buf anyway (because the hdr must be compressed
|
|
|
|
* to have the compressed buf). You might also think that #3 is
|
|
|
|
* sufficient to make this guarantee, however it's possible
|
|
|
|
* (specifically in the rare L2ARC write race mentioned in
|
|
|
|
* arc_buf_alloc_impl()) there will be an existing uncompressed buf that
|
2019-09-03 03:56:41 +03:00
|
|
|
* is shareable, but wasn't at the time of its allocation. Rather than
|
2016-07-14 00:17:41 +03:00
|
|
|
* allow a new shared uncompressed buf to be created and then shuffle
|
|
|
|
* the list around to make it the last element, this simply disallows
|
|
|
|
* sharing if the new buf isn't the first to be added.
|
|
|
|
*/
|
|
|
|
ASSERT3P(buf->b_hdr, ==, hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t hdr_compressed =
|
|
|
|
arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF;
|
2017-04-12 00:56:54 +03:00
|
|
|
boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
return (!ARC_BUF_ENCRYPTED(buf) &&
|
|
|
|
buf_compressed == hdr_compressed &&
|
2016-07-14 00:17:41 +03:00
|
|
|
hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
|
|
|
|
!HDR_SHARED_DATA(hdr) &&
|
|
|
|
(ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a buf for this hdr. If you care about the data that's in the hdr,
|
|
|
|
* or if you want a compressed buffer, pass those flags in. Returns 0 if the
|
|
|
|
* copy was made successfully, or an error code otherwise.
|
|
|
|
*/
|
|
|
|
static int
|
2018-05-03 01:36:20 +03:00
|
|
|
arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb,
|
2022-04-19 21:38:30 +03:00
|
|
|
const void *tag, boolean_t encrypted, boolean_t compressed,
|
|
|
|
boolean_t noauth, boolean_t fill, arc_buf_t **ret)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_t *buf;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_fill_flags_t flags = ARC_FILL_LOCKED;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
|
|
|
|
VERIFY(hdr->b_type == ARC_BUFC_DATA ||
|
|
|
|
hdr->b_type == ARC_BUFC_METADATA);
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3P(ret, !=, NULL);
|
|
|
|
ASSERT3P(*ret, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
IMPLY(encrypted, compressed);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_hdr = hdr;
|
|
|
|
buf->b_data = NULL;
|
2016-07-11 20:45:52 +03:00
|
|
|
buf->b_next = hdr->b_l1hdr.b_buf;
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags = 0;
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
add_reference(hdr, tag);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're about to change the hdr's b_flags. We must either
|
|
|
|
* hold the hash_lock or be undiscoverable.
|
|
|
|
*/
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Only honor requests for compressed bufs if the hdr is actually
|
2019-09-03 03:56:41 +03:00
|
|
|
* compressed. This must be overridden if the buffer is encrypted since
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* encrypted buffers cannot be decompressed.
|
2016-07-14 00:17:41 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (encrypted) {
|
|
|
|
buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
|
|
|
|
buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED;
|
|
|
|
flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED;
|
|
|
|
} else if (compressed &&
|
|
|
|
arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
flags |= ARC_FILL_COMPRESSED;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (noauth) {
|
|
|
|
ASSERT0(encrypted);
|
|
|
|
flags |= ARC_FILL_NOAUTH;
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the hdr's data can be shared then we share the data buffer and
|
|
|
|
* set the appropriate bit in the hdr's b_flags to indicate the hdr is
|
single-chunk scatter ABDs can be treated as linear
Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.
https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).
This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.
All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.
Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.
Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
2019-06-11 19:02:31 +03:00
|
|
|
* sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new
|
|
|
|
* buffer to store the buf's data.
|
2016-07-14 00:17:41 +03:00
|
|
|
*
|
2016-07-22 18:52:49 +03:00
|
|
|
* There are two additional restrictions here because we're sharing
|
|
|
|
* hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be
|
|
|
|
* actively involved in an L2ARC write, because if this buf is used by
|
|
|
|
* an arc_write() then the hdr's data buffer will be released when the
|
2016-07-14 00:17:41 +03:00
|
|
|
* write completes, even though the L2ARC write might still be using it.
|
2016-07-22 18:52:49 +03:00
|
|
|
* Second, the hdr's ABD must be linear so that the buf's user doesn't
|
single-chunk scatter ABDs can be treated as linear
Scatter ABD's are allocated from a number of pages. In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer. Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer. This can have a
measurable performance overhead on some workloads.
https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter. For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path. For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).
This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.
All single-page (4K) ABD's can be represented this way. Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages). This is often the case for 2-page (8K)
ABD's.
Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.
Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space. This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
2019-06-11 19:02:31 +03:00
|
|
|
* need to be ABD-aware. It must be allocated via
|
|
|
|
* zio_[data_]buf_alloc(), not as a page, because we need to be able
|
|
|
|
* to abd_release_ownership_of_buf(), which isn't allowed on "linear
|
|
|
|
* page" buffers because the ABD code needs to handle freeing them
|
|
|
|
* specially.
|
|
|
|
*/
|
|
|
|
boolean_t can_share = arc_can_share(hdr, buf) &&
|
|
|
|
!HDR_L2_WRITING(hdr) &&
|
|
|
|
hdr->b_l1hdr.b_pabd != NULL &&
|
|
|
|
abd_is_linear(hdr->b_l1hdr.b_pabd) &&
|
|
|
|
!abd_is_linear_page(hdr->b_l1hdr.b_pabd);
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/* Set up b_data and sharing */
|
|
|
|
if (can_share) {
|
2016-07-22 18:52:49 +03:00
|
|
|
buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags |= ARC_BUF_FLAG_SHARED;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
} else {
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, arc_buf_size(buf), buf);
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
VERIFY3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
hdr->b_l1hdr.b_buf = buf;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If the user wants the data from the hdr, we need to either copy or
|
|
|
|
* decompress the data.
|
|
|
|
*/
|
|
|
|
if (fill) {
|
2018-05-03 01:36:20 +03:00
|
|
|
ASSERT3P(zb, !=, NULL);
|
|
|
|
return (arc_buf_fill(buf, spa, zb, flags));
|
2016-07-14 00:17:41 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2022-04-19 21:38:30 +03:00
|
|
|
static const char *arc_onloan_tag = "onloan";
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
static inline void
|
|
|
|
arc_loaned_bytes_update(int64_t delta)
|
|
|
|
{
|
|
|
|
atomic_add_64(&arc_loaned_bytes, delta);
|
|
|
|
|
|
|
|
/* assert that it did not wrap around */
|
|
|
|
ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Loan out an anonymous arc buffer. Loaned buffers are not counted as in
|
|
|
|
* flight data by arc_tempreserve_space() until they are "returned". Loaned
|
|
|
|
* buffers must be returned to the arc before they can be used by the DMU or
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
|
|
|
|
is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-02-20 08:31:14 +03:00
|
|
|
arc_loaned_bytes_update(arc_buf_size(buf));
|
2017-04-12 00:56:54 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *
|
|
|
|
arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
enum zio_compress compression_type, uint8_t complevel)
|
2016-07-11 20:45:52 +03:00
|
|
|
{
|
|
|
|
arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
psize, lsize, compression_type, complevel);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2018-02-20 08:31:14 +03:00
|
|
|
arc_loaned_bytes_update(arc_buf_size(buf));
|
2017-04-12 00:56:54 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_buf_t *
|
|
|
|
arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder,
|
|
|
|
const uint8_t *salt, const uint8_t *iv, const uint8_t *mac,
|
|
|
|
dmu_object_type_t ot, uint64_t psize, uint64_t lsize,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
enum zio_compress compression_type, uint8_t complevel)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
byteorder, salt, iv, mac, ot, psize, lsize, compression_type,
|
|
|
|
complevel);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, psize);
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Return a loaned arc buffer to the arc.
|
|
|
|
*/
|
|
|
|
void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_return_buf(arc_buf_t *buf, const void *tag)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2018-09-26 20:29:26 +03:00
|
|
|
(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_loaned_bytes_update(-arc_buf_size(buf));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Detach an arc_buf from a dbuf (tag) */
|
|
|
|
void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_loan_inuse_buf(arc_buf_t *buf, const void *tag)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2018-09-26 20:29:26 +03:00
|
|
|
(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_loaned_bytes_update(arc_buf_size(buf));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
2016-07-22 18:52:49 +03:00
|
|
|
l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
df->l2df_abd = abd;
|
2016-06-02 07:04:53 +03:00
|
|
|
df->l2df_size = size;
|
|
|
|
df->l2df_type = type;
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
|
|
|
list_insert_head(l2arc_free_on_write, df);
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);
|
2012-12-22 02:57:09 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* protected by hash lock, if in the hash table */
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(state != arc_anon && state != arc_l2c_only);
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
size, hdr);
|
2012-12-22 02:57:09 +04:00
|
|
|
}
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_size[type], size, hdr);
|
2017-02-28 01:47:33 +03:00
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
arc_space_return(size, ARC_SPACE_META);
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
arc_space_return(size, ARC_SPACE_DATA);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (free_rdata) {
|
|
|
|
l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type);
|
|
|
|
} else {
|
|
|
|
l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Share the arc_buf_t's data with the hdr. Whenever we are sharing the
|
|
|
|
* data buffer, we transfer the refcount ownership to the hdr and update
|
|
|
|
* the appropriate kstats.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(arc_can_share(hdr, buf));
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!ARC_BUF_ENCRYPTED(buf));
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Start sharing the data buffer. We transfer the
|
|
|
|
* refcount ownership to the hdr since it always owns
|
|
|
|
* the refcount whenever an arc_buf_t is shared.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
zfs_refcount_transfer_ownership_many(
|
|
|
|
&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],
|
2018-10-09 00:58:21 +03:00
|
|
|
arc_hdr_size(hdr), buf, hdr);
|
2016-07-22 18:52:49 +03:00
|
|
|
hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));
|
|
|
|
abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,
|
|
|
|
HDR_ISTYPE_METADATA(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags |= ARC_BUF_FLAG_SHARED;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Since we've transferred ownership to the hdr we need
|
|
|
|
* to increment its compressed and uncompressed kstats and
|
|
|
|
* decrement the overhead size.
|
|
|
|
*/
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(arc_buf_is_shared(buf));
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* We are no longer sharing this buffer so we need
|
|
|
|
* to transfer its ownership to the rightful owner.
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
zfs_refcount_transfer_ownership_many(
|
|
|
|
&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],
|
2018-10-09 00:58:21 +03:00
|
|
|
arc_hdr_size(hdr), hdr, buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(hdr->b_l1hdr.b_pabd);
|
2016-07-22 18:52:49 +03:00
|
|
|
hdr->b_l1hdr.b_pabd = NULL;
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the buffer is no longer shared between
|
|
|
|
* the arc buf and the hdr, count it as overhead.
|
|
|
|
*/
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Remove an arc_buf_t from the hdr's buf list and return the last
|
|
|
|
* arc_buf_t on the list. If no buffers remain on the list then return
|
|
|
|
* NULL.
|
|
|
|
*/
|
|
|
|
static arc_buf_t *
|
|
|
|
arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
|
|
|
|
arc_buf_t *lastbuf = NULL;
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* Remove the buf from the hdr list and locate the last
|
|
|
|
* remaining buffer on the list.
|
|
|
|
*/
|
|
|
|
while (*bufp != NULL) {
|
|
|
|
if (*bufp == buf)
|
|
|
|
*bufp = buf->b_next;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've removed a buffer in the middle of
|
|
|
|
* the list then update the lastbuf and update
|
|
|
|
* bufp.
|
|
|
|
*/
|
|
|
|
if (*bufp != NULL) {
|
|
|
|
lastbuf = *bufp;
|
|
|
|
bufp = &(*bufp)->b_next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
buf->b_next = NULL;
|
|
|
|
ASSERT3P(lastbuf, !=, buf);
|
|
|
|
IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
|
|
|
|
|
|
|
|
return (lastbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's
|
2016-07-11 20:45:52 +03:00
|
|
|
* list and free it.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Free up the data associated with the buf but only if we're not
|
|
|
|
* sharing this with the hdr. If we are sharing it with the hdr, the
|
|
|
|
* hdr is responsible for doing the free.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (buf->b_data != NULL) {
|
|
|
|
/*
|
|
|
|
* We're about to change the hdr's b_flags. We must either
|
|
|
|
* hold the hash_lock or be undiscoverable.
|
|
|
|
*/
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(HDR_EMPTY_OR_LOCKED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_cksum_verify(buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_unwatch(buf);
|
|
|
|
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
} else {
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
uint64_t size = arc_buf_size(buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_free_data_buf(hdr, buf->b_data, size, buf);
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, -size);
|
|
|
|
}
|
|
|
|
buf->b_data = NULL;
|
|
|
|
|
2023-10-06 18:56:17 +03:00
|
|
|
/*
|
|
|
|
* If we have no more encrypted buffers and we've already
|
|
|
|
* gotten a copy of the decrypted data we can free b_rabd
|
|
|
|
* to save some space.
|
|
|
|
*/
|
|
|
|
if (ARC_BUF_ENCRYPTED(buf) && HDR_HAS_RABD(hdr) &&
|
|
|
|
hdr->b_l1hdr.b_pabd != NULL && !HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
arc_buf_t *b;
|
|
|
|
for (b = hdr->b_l1hdr.b_buf; b; b = b->b_next) {
|
|
|
|
if (b != buf && ARC_BUF_ENCRYPTED(b))
|
|
|
|
break;
|
2017-11-18 02:11:39 +03:00
|
|
|
}
|
2023-10-06 18:56:17 +03:00
|
|
|
if (b == NULL)
|
|
|
|
arc_hdr_free_abd(hdr, B_TRUE);
|
2017-09-28 18:49:13 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* If the current arc_buf_t is sharing its data buffer with the
|
2016-07-22 18:52:49 +03:00
|
|
|
* hdr, then reassign the hdr's b_pabd to share it with the new
|
2016-07-14 00:17:41 +03:00
|
|
|
* buffer at the end of the list. The shared buffer is always
|
|
|
|
* the last one on the hdr's buffer list.
|
|
|
|
*
|
|
|
|
* There is an equivalent case for compressed bufs, but since
|
|
|
|
* they aren't guaranteed to be the last buf in the list and
|
|
|
|
* that is an exceedingly rare case, we just allow that space be
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* wasted temporarily. We must also be careful not to share
|
|
|
|
* encrypted buffers, since they cannot be shared.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) {
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Only one buf can be shared at once */
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(lastbuf));
|
2016-07-14 00:17:41 +03:00
|
|
|
/* hdr is uncompressed so can't have compressed buf */
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(lastbuf));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* We must setup a new shared block between the
|
|
|
|
* last buffer and the hdr. The data would have
|
|
|
|
* been allocated by the arc buf so we need to transfer
|
|
|
|
* ownership to the hdr since it's now being shared.
|
|
|
|
*/
|
|
|
|
arc_share_buf(hdr, lastbuf);
|
|
|
|
}
|
|
|
|
} else if (HDR_SHARED_DATA(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Uncompressed shared buffers are always at the end
|
|
|
|
* of the list. Compressed buffers don't have the
|
|
|
|
* same requirements. This makes it hard to
|
|
|
|
* simply assert that the lastbuf is shared so
|
|
|
|
* we rely on the hdr's compression flags to determine
|
|
|
|
* if we have a compressed, shared buffer.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(lastbuf, !=, NULL);
|
|
|
|
ASSERT(arc_buf_is_shared(lastbuf) ||
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
/*
|
|
|
|
* Free the checksum if we're removing the last uncompressed buf from
|
|
|
|
* this hdr.
|
|
|
|
*/
|
|
|
|
if (!arc_hdr_has_uncompressed_buf(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_free(hdr);
|
2017-04-12 00:56:54 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/* clean up the buf */
|
|
|
|
buf->b_hdr = NULL;
|
|
|
|
kmem_cache_free(buf_cache, buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2020-08-12 20:03:24 +03:00
|
|
|
arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
uint64_t size;
|
2020-08-12 20:03:24 +03:00
|
|
|
boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata);
|
|
|
|
IMPLY(alloc_rdata, HDR_PROTECTED(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (alloc_rdata) {
|
|
|
|
size = HDR_GET_PSIZE(hdr);
|
|
|
|
ASSERT3P(hdr->b_crypt_hdr.b_rabd, ==, NULL);
|
2020-08-12 20:03:24 +03:00
|
|
|
hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr,
|
2021-08-17 19:15:54 +03:00
|
|
|
alloc_flags);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL);
|
|
|
|
ARCSTAT_INCR(arcstat_raw_size, size);
|
|
|
|
} else {
|
|
|
|
size = arc_hdr_size(hdr);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
2020-08-12 20:03:24 +03:00
|
|
|
hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr,
|
2021-08-17 19:15:54 +03:00
|
|
|
alloc_flags);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, size);
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));
|
|
|
|
IMPLY(free_rdata, HDR_HAS_RABD(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If the hdr is currently being written to the l2arc then
|
|
|
|
* we defer freeing the data by adding it to the l2arc_free_on_write
|
|
|
|
* list. The l2arc will free the data once it's finished
|
|
|
|
* writing it to the l2arc device.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_L2_WRITING(hdr)) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_on_write(hdr, free_rdata);
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_free_on_write);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
} else if (free_rdata) {
|
|
|
|
arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (free_rdata) {
|
|
|
|
hdr->b_crypt_hdr.b_rabd = NULL;
|
|
|
|
ARCSTAT_INCR(arcstat_raw_size, -size);
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_pabd = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr))
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
|
|
|
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, -size);
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
|
|
|
|
2021-08-17 19:15:54 +03:00
|
|
|
/*
|
|
|
|
* Allocate empty anonymous ARC header. The header will get its identity
|
|
|
|
* assigned and buffers attached later as part of read or write operations.
|
|
|
|
*
|
|
|
|
* In case of read arc_read() assigns header its identify (b_dva + b_birth),
|
|
|
|
* inserts it into ARC hash to become globally visible and allocates physical
|
|
|
|
* (b_pabd) or raw (b_rabd) ABD buffer to read into from disk. On disk read
|
|
|
|
* completion arc_read_done() allocates ARC buffer(s) as needed, potentially
|
|
|
|
* sharing one of them with the physical ABD buffer.
|
|
|
|
*
|
|
|
|
* In case of write arc_alloc_buf() allocates ARC buffer to be filled with
|
|
|
|
* data. Then after compression and/or encryption arc_write_ready() allocates
|
|
|
|
* and fills (or potentially shares) physical (b_pabd) or raw (b_rabd) ABD
|
|
|
|
* buffer. On disk write completion arc_write_done() assigns the header its
|
|
|
|
* new identity (b_dva + b_birth) and inserts into ARC hash.
|
|
|
|
*
|
|
|
|
* In case of partial overwrite the old data is read first as described. Then
|
|
|
|
* arc_release() either allocates new anonymous ARC header and moves the ARC
|
|
|
|
* buffer to it, or reuses the old ARC header by discarding its identity and
|
|
|
|
* removing it from ARC hash. After buffer modification normal write process
|
|
|
|
* follows as described.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static arc_buf_hdr_t *
|
|
|
|
arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
boolean_t protected, enum zio_compress compression_type, uint8_t complevel,
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_buf_contents_t type)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
|
|
|
|
VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_SET_PSIZE(hdr, psize);
|
|
|
|
HDR_SET_LSIZE(hdr, lsize);
|
|
|
|
hdr->b_spa = spa;
|
|
|
|
hdr->b_type = type;
|
|
|
|
hdr->b_flags = 0;
|
|
|
|
arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_hdr_set_compress(hdr, compression_type);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
hdr->b_complevel = complevel;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (protected)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_state = arc_anon;
|
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
2021-08-17 18:50:31 +03:00
|
|
|
hdr->b_l1hdr.b_mru_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_ghost_hits = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_buf = NULL;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2014-07-15 11:43:18 +04:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Transition between the two allocation states for the arc_buf_hdr struct.
|
|
|
|
* The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
|
|
|
|
* (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
|
|
|
|
* version is used when a cache buffer is only in the L2ARC in order to reduce
|
|
|
|
* memory usage.
|
2014-07-15 11:43:18 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static arc_buf_hdr_t *
|
|
|
|
arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-11-04 23:25:13 +03:00
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *nhdr;
|
|
|
|
l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
|
|
|
|
(old == hdr_l2only_cache && new == hdr_full_cache));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
|
|
|
buf_hash_remove(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(nhdr, hdr, HDR_L2ONLY_SIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
if (new == hdr_full_cache) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
|
|
|
|
/*
|
|
|
|
* arc_access and arc_change_state need to be aware that a
|
|
|
|
* header has just come out of L2ARC, so we set its state to
|
|
|
|
* l2c_only even though it's about to change.
|
|
|
|
*/
|
|
|
|
nhdr->b_l1hdr.b_state = arc_l2c_only;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* Verify previous threads set to NULL before freeing */
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* If we've reached here, We must have been called from
|
|
|
|
* arc_evict_hdr(), as such we should have already been
|
|
|
|
* removed from any ghost list we were previously on
|
|
|
|
* (which protects us from racing with arc_evict_state),
|
|
|
|
* thus no locking is needed during this check.
|
|
|
|
*/
|
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2012-12-22 02:57:09 +04:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* A buffer must not be moved into the arc_l2c_only
|
|
|
|
* state if it's not finished being written out to the
|
2016-07-22 18:52:49 +03:00
|
|
|
* l2arc device. Otherwise, the b_l1hdr.b_pabd field
|
2016-06-02 07:04:53 +03:00
|
|
|
* might try to be accessed, even though it was removed.
|
2012-12-22 02:57:09 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY(!HDR_L2_WRITING(hdr));
|
2016-07-22 18:52:49 +03:00
|
|
|
VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The header has been reallocated so we need to re-insert it into any
|
|
|
|
* lists it was on.
|
|
|
|
*/
|
|
|
|
(void) buf_hash_insert(nhdr, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must place the realloc'ed header back into the list at
|
|
|
|
* the same spot. Otherwise, if it's placed earlier in the list,
|
|
|
|
* l2arc_write_buffers() could find it during the function's
|
|
|
|
* write phase, and try to write it out to the l2arc.
|
|
|
|
*/
|
|
|
|
list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
|
|
|
|
list_remove(&dev->l2ad_buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Since we're using the pointer address as the tag when
|
|
|
|
* incrementing and decrementing the l2ad_alloc refcount, we
|
|
|
|
* must remove the old pointer (that we're about to destroy) and
|
|
|
|
* add the new pointer to the refcount. Otherwise we'd remove
|
|
|
|
* the wrong pointer address when calling arc_hdr_destroy() later.
|
|
|
|
*/
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&dev->l2ad_alloc,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
(void) zfs_refcount_add_many(&dev->l2ad_alloc,
|
|
|
|
arc_hdr_size(nhdr), nhdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
buf_discard_identity(hdr);
|
|
|
|
kmem_cache_free(old, hdr);
|
|
|
|
|
|
|
|
return (nhdr);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* This function is used by the send / receive code to convert a newly
|
|
|
|
* allocated arc_buf_t to one that is suitable for a raw encrypted write. It
|
2019-09-03 03:56:41 +03:00
|
|
|
* is also used to allow the root objset block to be updated without altering
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* its embedded MACs. Both block types will always be uncompressed so we do not
|
|
|
|
* have to worry about compression type or psize.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,
|
|
|
|
dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,
|
|
|
|
const uint8_t *mac)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
|
|
|
|
buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
hdr->b_crypt_hdr.b_dsobj = dsobj;
|
|
|
|
hdr->b_crypt_hdr.b_ot = ot;
|
|
|
|
hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
|
|
|
|
DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
|
|
|
|
if (!arc_hdr_has_uncompressed_buf(hdr))
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
|
|
|
|
if (salt != NULL)
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (iv != NULL)
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (mac != NULL)
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
|
|
|
|
* The buf is returned thawed since we expect the consumer to modify it.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
arc_alloc_buf(spa_t *spa, const void *tag, arc_buf_contents_t type,
|
|
|
|
int32_t size)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
|
2021-08-17 19:15:54 +03:00
|
|
|
B_FALSE, ZIO_COMPRESS_OFF, 0, type);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_t *buf = NULL;
|
2018-05-03 01:36:20 +03:00
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
B_FALSE, B_FALSE, &buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_thaw(buf);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
|
|
|
|
* for bufs containing metadata.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
arc_alloc_compressed_buf(spa_t *spa, const void *tag, uint64_t psize,
|
|
|
|
uint64_t lsize, enum zio_compress compression_type, uint8_t complevel)
|
2016-07-11 20:45:52 +03:00
|
|
|
{
|
|
|
|
ASSERT3U(lsize, >, 0);
|
|
|
|
ASSERT3U(lsize, >=, psize);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
|
2021-08-17 19:15:54 +03:00
|
|
|
B_FALSE, compression_type, complevel, ARC_BUFC_DATA);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_t *buf = NULL;
|
2018-05-03 01:36:20 +03:00
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
B_TRUE, B_FALSE, B_FALSE, &buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
|
2021-08-17 19:15:54 +03:00
|
|
|
/*
|
|
|
|
* To ensure that the hdr has the correct data in it if we call
|
|
|
|
* arc_untransform() on this buf before it's been written to disk,
|
|
|
|
* it's easiest if we just set up sharing between the buf and the hdr.
|
|
|
|
*/
|
|
|
|
arc_share_buf(hdr, buf);
|
2016-07-22 18:52:49 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_buf_t *
|
2022-04-19 21:38:30 +03:00
|
|
|
arc_alloc_raw_buf(spa_t *spa, const void *tag, uint64_t dsobj,
|
|
|
|
boolean_t byteorder, const uint8_t *salt, const uint8_t *iv,
|
|
|
|
const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
enum zio_compress compression_type, uint8_t complevel)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ?
|
|
|
|
ARC_BUFC_METADATA : ARC_BUFC_DATA;
|
|
|
|
|
|
|
|
ASSERT3U(lsize, >, 0);
|
|
|
|
ASSERT3U(lsize, >=, psize);
|
|
|
|
ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);
|
|
|
|
|
|
|
|
hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE,
|
2021-08-17 19:15:54 +03:00
|
|
|
compression_type, complevel, type);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
hdr->b_crypt_hdr.b_dsobj = dsobj;
|
|
|
|
hdr->b_crypt_hdr.b_ot = ot;
|
|
|
|
hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?
|
|
|
|
DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);
|
|
|
|
memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);
|
|
|
|
memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This buffer will be considered encrypted even if the ot is not an
|
|
|
|
* encrypted type. It will become authenticated instead in
|
|
|
|
* arc_write_ready().
|
|
|
|
*/
|
|
|
|
buf = NULL;
|
2018-05-03 01:36:20 +03:00
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
B_FALSE, B_FALSE, &buf));
|
|
|
|
arc_buf_thaw(buf);
|
|
|
|
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
static void
|
|
|
|
l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,
|
|
|
|
boolean_t state_only)
|
|
|
|
{
|
|
|
|
l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
|
|
|
|
l2arc_dev_t *dev = l2hdr->b_dev;
|
|
|
|
uint64_t lsize = HDR_GET_LSIZE(hdr);
|
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
|
|
|
|
arc_buf_contents_t type = hdr->b_type;
|
|
|
|
int64_t lsize_s;
|
|
|
|
int64_t psize_s;
|
|
|
|
int64_t asize_s;
|
|
|
|
|
|
|
|
if (incr) {
|
|
|
|
lsize_s = lsize;
|
|
|
|
psize_s = psize;
|
|
|
|
asize_s = asize;
|
|
|
|
} else {
|
|
|
|
lsize_s = -lsize;
|
|
|
|
psize_s = -psize;
|
|
|
|
asize_s = -asize;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* If the buffer is a prefetch, count it as such. */
|
|
|
|
if (HDR_PREFETCH(hdr)) {
|
|
|
|
ARCSTAT_INCR(arcstat_l2_prefetch_asize, asize_s);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We use the value stored in the L2 header upon initial
|
|
|
|
* caching in L2ARC. This value will be updated in case
|
|
|
|
* an MRU/MRU_ghost buffer transitions to MFU but the L2ARC
|
|
|
|
* metadata (log entry) cannot currently be updated. Having
|
|
|
|
* the ARC state in the L2 header solves the problem of a
|
|
|
|
* possibly absent L1 header (apparent in buffers restored
|
|
|
|
* from persistent L2ARC).
|
|
|
|
*/
|
|
|
|
switch (hdr->b_l2hdr.b_arcs_state) {
|
|
|
|
case ARC_STATE_MRU_GHOST:
|
|
|
|
case ARC_STATE_MRU:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_mru_asize, asize_s);
|
|
|
|
break;
|
|
|
|
case ARC_STATE_MFU_GHOST:
|
|
|
|
case ARC_STATE_MFU:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_mfu_asize, asize_s);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (state_only)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ARCSTAT_INCR(arcstat_l2_psize, psize_s);
|
|
|
|
ARCSTAT_INCR(arcstat_l2_lsize, lsize_s);
|
|
|
|
|
|
|
|
switch (type) {
|
|
|
|
case ARC_BUFC_DATA:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_bufc_data_asize, asize_s);
|
|
|
|
break;
|
|
|
|
case ARC_BUFC_METADATA:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_bufc_metadata_asize, asize_s);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
|
|
|
|
l2arc_dev_t *dev = l2hdr->b_dev;
|
2019-01-31 20:16:39 +03:00
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
|
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
|
|
|
|
list_remove(&dev->l2ad_buflist, hdr);
|
|
|
|
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
l2arc_hdr_arcstats_decrement(hdr);
|
2019-01-31 20:16:39 +03:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -asize, 0, 0);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2019-01-31 20:16:39 +03:00
|
|
|
(void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr),
|
|
|
|
hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
|
2015-06-16 02:12:19 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_destroy(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
|
|
|
|
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2015-06-16 02:12:19 +03:00
|
|
|
l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
|
|
|
|
boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
if (!buflist_held)
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
2015-06-16 02:12:19 +03:00
|
|
|
* Even though we checked this conditional above, we
|
|
|
|
* need to check this again now that we have the
|
|
|
|
* l2ad_mtx. This is because we could be racing with
|
|
|
|
* another thread calling l2arc_evict() which might have
|
|
|
|
* destroyed this header's L2 portion as we were waiting
|
|
|
|
* to acquire the l2ad_mtx. If that happens, we don't
|
|
|
|
* want to re-destroy the header's L2 portion.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2021-09-16 19:40:15 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
|
|
|
|
|
|
|
if (!HDR_EMPTY(hdr))
|
|
|
|
buf_discard_identity(hdr);
|
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2021-09-16 19:40:15 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (!buflist_held)
|
2015-06-16 02:12:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
/*
|
|
|
|
* The header's identify can only be safely discarded once it is no
|
|
|
|
* longer discoverable. This requires removing it from the hash table
|
|
|
|
* and the l2arc header list. After this point the hash lock can not
|
|
|
|
* be used to protect the header.
|
|
|
|
*/
|
|
|
|
if (!HDR_EMPTY(hdr))
|
|
|
|
buf_discard_identity(hdr);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
arc_cksum_free(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
while (hdr->b_l1hdr.b_buf != NULL)
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
|
|
|
|
2017-09-28 18:49:13 +03:00
|
|
|
if (HDR_HAS_RABD(hdr))
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(hdr, B_TRUE);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3P(hdr->b_hash_next, ==, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
|
|
|
#endif
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
kmem_cache_free(hdr_full_cache, hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else {
|
|
|
|
kmem_cache_free(hdr_l2only_cache, hdr);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_buf_destroy(arc_buf_t *buf, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);
|
|
|
|
ASSERT(ARC_BUF_LAST(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
VERIFY0(remove_reference(hdr, tag));
|
2016-06-02 07:04:53 +03:00
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
kmutex_t *hash_lock = HDR_LOCK(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2019-03-16 00:17:38 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr, ==, buf->b_hdr);
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
|
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(buf);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
(void) remove_reference(hdr, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Evict the arc_buf_hdr that is provided as a parameter. The resultant
|
|
|
|
* state of the header is dependent on its state prior to entering this
|
|
|
|
* function. The following transitions are possible:
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2015-01-13 06:52:19 +03:00
|
|
|
* - arc_mru -> arc_mru_ghost
|
|
|
|
* - arc_mfu -> arc_mfu_ghost
|
|
|
|
* - arc_mru_ghost -> arc_l2c_only
|
|
|
|
* - arc_mru_ghost -> deleted
|
|
|
|
* - arc_mfu_ghost -> arc_l2c_only
|
|
|
|
* - arc_mfu_ghost -> deleted
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
* - arc_uncached -> deleted
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
*
|
|
|
|
* Return total size of evicted data buffers for eviction progress tracking.
|
|
|
|
* When evicting from ghost states return logical buffer size to make eviction
|
|
|
|
* progress at the same (or at least comparable) rate as from non-ghost states.
|
|
|
|
*
|
|
|
|
* Return *real_evicted for actual ARC size reduction to wake up threads
|
|
|
|
* waiting for it. For non-ghost states it includes size of evicted data
|
|
|
|
* buffers (the headers are not freed there). For ghost states it includes
|
|
|
|
* only the evicted headers size.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
static int64_t
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_evict_hdr(arc_buf_hdr_t *hdr, uint64_t *real_evicted)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_state_t *evicted_state, *state;
|
|
|
|
int64_t bytes_evicted = 0;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ?
|
2017-11-16 04:27:01 +03:00
|
|
|
arc_min_prescient_prefetch_ms : arc_min_prefetch_ms;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2023-01-09 21:45:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt));
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
*real_evicted = 0;
|
2015-01-13 06:52:19 +03:00
|
|
|
state = hdr->b_l1hdr.b_state;
|
|
|
|
if (GHOST_STATE(state)) {
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* l2arc_write_buffers() relies on a header's L1 portion
|
2016-07-22 18:52:49 +03:00
|
|
|
* (i.e. its b_pabd field) during it's write phase.
|
2015-01-13 06:52:19 +03:00
|
|
|
* Thus, we cannot push a header onto the arc_l2c_only
|
|
|
|
* state (removing its L1 piece) until the header is
|
|
|
|
* done being written to the l2arc.
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_l2_skip);
|
|
|
|
return (bytes_evicted);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_deleted);
|
2016-06-02 07:04:53 +03:00
|
|
|
bytes_evicted += HDR_GET_LSIZE(hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_pabd == NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* This buffer is cached on the 2nd Level ARC;
|
|
|
|
* don't destroy the header.
|
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_l2c_only, hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* dropping from L1+L2 cached to L2-only,
|
|
|
|
* realloc to remove the L1 header.
|
|
|
|
*/
|
Cleanup: Address Clang's static analyzer's unused code complaints
These were categorized as the following:
* Dead assignment 23
* Dead increment 4
* Dead initialization 6
* Dead nested assignment 18
Most of these are harmless, but since actual issues can hide among them,
we correct them.
That said, there were a few return values that were being ignored that
appeared to merit some correction:
* `destroy_callback()` in `cmd/zfs/zfs_main.c` ignored the error from
`destroy_batched()`. We handle it by returning -1 if there is an
error.
* `zfs_do_upgrade()` in `cmd/zfs/zfs_main.c` ignored the error from
`zfs_for_each()`. We handle it by doing a binary OR of the error
value from the subsequent `zfs_for_each()` call to the existing
value. This is how errors are mostly handled inside `zfs_for_each()`.
The error value here is passed to exit from the zfs command, so doing
a binary or on it is better than what we did previously.
* `get_zap_prop()` in `module/zfs/zcp_get.c` ignored the error from
`dsl_prop_get_ds()` when the property is not of type string. We
return an error when it does. There is a small concern that the
`zfs_get_temporary_prop()` call would handle things, but in the case
that it does not, we would be pushing an uninitialized numval onto
the lua stack. It is expected that `dsl_prop_get_ds()` will succeed
anytime that `zfs_get_temporary_prop()` does, so that not giving it a
chance to fix things is not a problem.
* `draid_merge_impl()` in `tests/zfs-tests/cmd/draid.c` used
`nvlist_add_nvlist()` twice in ways in which errors are expected to
be impossible, so we switch to `fnvlist_add_nvlist()`.
A few notable ones did not merit use of the return value, so we
suppressed it with `(void)`:
* `write_free_diffs()` in `lib/libzfs/libzfs_diff.c` ignored the error
value from `describe_free()`. A look through the commit history
revealed that this was intentional.
* `arc_evict_hdr()` in `module/zfs/arc.c` did not need to use the
returned handle from `arc_hdr_realloc()` because it is already
referenced in lists.
* `spa_vdev_detach()` in `module/zfs/spa.c` has a comment explicitly
saying not to use the error from `vdev_label_init()` because whatever
causes the error could be the reason why a detach is being done.
Unfortunately, I am not presently able to analyze the kernel modules
with Clang's static analyzer, so I could have missed some cases of this.
In cases where reports were present in code that is duplicated between
Linux and FreeBSD, I made a conscious effort to fix the FreeBSD version
too.
After this commit is merged, regressions like dee8934 should become
extremely obvious with Clang's static analyzer since a regression would
appear in the results as the only instance of unused code. That assumes
that Coverity does not catch the issue first.
My local branch with fixes from all of my outstanding non-draft pull
requests shows 118 reports from Clang's static anlayzer after this
patch. That is down by 51 from 169.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Cedric Berger <cedric@precidata.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13986
2022-10-14 23:37:54 +03:00
|
|
|
(void) arc_hdr_realloc(hdr, hdr_full_cache,
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr_l2only_cache);
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
*real_evicted += HDR_FULL_SIZE - HDR_L2ONLY_SIZE;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_anon, hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
*real_evicted += HDR_FULL_SIZE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
ASSERT(state == arc_mru || state == arc_mfu || state == arc_uncached);
|
|
|
|
evicted_state = (state == arc_uncached) ? arc_anon :
|
|
|
|
((state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* prefetch buffers have a minimum lifespan */
|
2022-12-22 23:10:24 +03:00
|
|
|
if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
|
2018-02-06 03:57:53 +03:00
|
|
|
ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
|
2022-12-22 23:10:24 +03:00
|
|
|
MSEC_TO_TICK(min_lifetime)) {
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_evict_skip);
|
|
|
|
return (bytes_evicted);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (l2arc_write_eligible(hdr->b_spa, hdr)) {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_eligible,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
|
|
|
|
switch (state->arcs_state) {
|
|
|
|
case ARC_STATE_MRU:
|
|
|
|
ARCSTAT_INCR(
|
|
|
|
arcstat_evict_l2_eligible_mru,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
|
|
|
break;
|
|
|
|
case ARC_STATE_MFU:
|
|
|
|
ARCSTAT_INCR(
|
|
|
|
arcstat_evict_l2_eligible_mfu,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_ineligible,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2023-01-09 21:45:17 +03:00
|
|
|
bytes_evicted += arc_hdr_size(hdr);
|
|
|
|
*real_evicted += arc_hdr_size(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2023-01-09 21:45:17 +03:00
|
|
|
/*
|
|
|
|
* If this hdr is being evicted and has a compressed buffer then we
|
|
|
|
* discard it here before we change states. This ensures that the
|
|
|
|
* accounting is updated correctly in arc_free_data_impl().
|
|
|
|
*/
|
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL)
|
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2023-01-09 21:45:17 +03:00
|
|
|
if (HDR_HAS_RABD(hdr))
|
|
|
|
arc_hdr_free_abd(hdr, B_TRUE);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2023-01-09 21:45:17 +03:00
|
|
|
arc_change_state(evicted_state, hdr);
|
|
|
|
DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
|
|
|
|
if (evicted_state == arc_anon) {
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
*real_evicted += HDR_FULL_SIZE;
|
|
|
|
} else {
|
|
|
|
ASSERT(HDR_IN_HASH_TABLE(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
static void
|
|
|
|
arc_set_need_free(void)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&arc_evict_lock));
|
|
|
|
int64_t remaining = arc_free_memory() - arc_sys_free / 2;
|
|
|
|
arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters);
|
|
|
|
if (aw == NULL) {
|
|
|
|
arc_need_free = MAX(-remaining, 0);
|
|
|
|
} else {
|
|
|
|
arc_need_free =
|
|
|
|
MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
static uint64_t
|
|
|
|
arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
|
2021-07-20 17:13:21 +03:00
|
|
|
uint64_t spa, uint64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_t *mls;
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
uint64_t bytes_evicted = 0, real_evicted = 0;
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_buf_hdr_t *hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
uint_t evict_count = zfs_arc_evict_batch_limit;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT3P(marker, !=, NULL);
|
|
|
|
|
|
|
|
mls = multilist_sublist_lock(ml, idx);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2021-07-20 17:13:21 +03:00
|
|
|
for (hdr = multilist_sublist_prev(mls, marker); likely(hdr != NULL);
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr = multilist_sublist_prev(mls, marker)) {
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
if ((evict_count == 0) || (bytes_evicted >= bytes))
|
2015-01-13 06:52:19 +03:00
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To keep our iteration location, move the marker
|
|
|
|
* forward. Since we're not holding hdr's hash lock, we
|
|
|
|
* must be very careful and not remove 'hdr' from the
|
|
|
|
* sublist. Otherwise, other consumers might mistake the
|
|
|
|
* 'hdr' as not being on a sublist when they call the
|
|
|
|
* multilist_link_active() function (they all rely on
|
|
|
|
* the hash lock protecting concurrent insertions and
|
|
|
|
* removals). multilist_sublist_move_forward() was
|
|
|
|
* specifically implemented to ensure this is the case
|
|
|
|
* (only 'marker' will be removed and re-inserted).
|
|
|
|
*/
|
|
|
|
multilist_sublist_move_forward(mls, marker);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The only case where the b_spa field should ever be
|
|
|
|
* zero, is the marker headers inserted by
|
|
|
|
* arc_evict_state(). It's possible for multiple threads
|
|
|
|
* to be calling arc_evict_state() concurrently (e.g.
|
|
|
|
* dsl_pool_close() and zio_inject_fault()), so we must
|
|
|
|
* skip any markers we see from these other threads.
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
if (hdr->b_spa == 0)
|
2010-08-27 01:24:34 +04:00
|
|
|
continue;
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* we're only interested in evicting buffers of a certain spa */
|
|
|
|
if (spa != 0 && hdr->b_spa != spa) {
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_skip);
|
2010-05-29 00:45:14 +04:00
|
|
|
continue;
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
hash_lock = HDR_LOCK(hdr);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* We aren't calling this function from any code path
|
|
|
|
* that would already be holding a hash lock, so we're
|
|
|
|
* asserting on this assumption to be defensive in case
|
|
|
|
* this ever changes. Without this check, it would be
|
|
|
|
* possible to incorrectly increment arcstat_mutex_miss
|
|
|
|
* below (e.g. if the code changed such that we called
|
|
|
|
* this function with a hash lock held).
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!MUTEX_HELD(hash_lock));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (mutex_tryenter(hash_lock)) {
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
uint64_t revicted;
|
2022-12-22 23:10:24 +03:00
|
|
|
uint64_t evicted = arc_evict_hdr(hdr, &revicted);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
bytes_evicted += evicted;
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
real_evicted += revicted;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* If evicted is zero, arc_evict_hdr() must have
|
|
|
|
* decided to skip this header, don't increment
|
|
|
|
* evict_count in this case.
|
2010-08-27 01:24:34 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (evicted != 0)
|
2021-07-20 17:13:21 +03:00
|
|
|
evict_count--;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
} else {
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mutex_miss);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
/*
|
|
|
|
* Increment the count of evicted bytes, and wake up any threads that
|
|
|
|
* are waiting for the count to reach this value. Since the list is
|
|
|
|
* ordered by ascending aew_count, we pop off the beginning of the
|
|
|
|
* list until we reach the end, or a waiter that's past the current
|
|
|
|
* "count". Doing this outside the loop reduces the number of times
|
|
|
|
* we need to acquire the global arc_evict_lock.
|
|
|
|
*
|
|
|
|
* Only wake when there's sufficient free memory in the system
|
|
|
|
* (specifically, arc_sys_free/2, which by default is a bit more than
|
|
|
|
* 1/64th of RAM). See the comments in arc_wait_for_eviction().
|
|
|
|
*/
|
|
|
|
mutex_enter(&arc_evict_lock);
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
arc_evict_count += real_evicted;
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
2021-01-08 07:06:32 +03:00
|
|
|
if (arc_free_memory() > arc_sys_free / 2) {
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
arc_evict_waiter_t *aw;
|
|
|
|
while ((aw = list_head(&arc_evict_waiters)) != NULL &&
|
|
|
|
aw->aew_count <= arc_evict_count) {
|
|
|
|
list_remove(&arc_evict_waiters, aw);
|
|
|
|
cv_broadcast(&aw->aew_cv);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
arc_set_need_free();
|
|
|
|
mutex_exit(&arc_evict_lock);
|
|
|
|
|
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`. Before commit 3ec34e55271d433e3c, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`. However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.
Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`. Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC. During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).
To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e55271d433e3c.
Note: commit 3ec34e55271d433e3c is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.
Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.
To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 20:42:27 +03:00
|
|
|
/*
|
|
|
|
* If the ARC size is reduced from arc_c_max to arc_c_min (especially
|
|
|
|
* if the average cached block is small), eviction can be on-CPU for
|
|
|
|
* many seconds. To ensure that other threads that may be bound to
|
|
|
|
* this CPU are able to make progress, make a voluntary preemption
|
|
|
|
* call here.
|
|
|
|
*/
|
Cleanup: Use OpenSolaris functions to call scheduler
In our codebase, `cond_resched() and `schedule()` are Linux kernel
functions that have replaced the OpenSolaris `kpreempt()` functions in
the codebase to such an extent that `kpreempt()` in zfs_context.h was
broken. Nobody noticed because we did not actually use it. The header
had defined `kpreempt()` as `yield()`, which works on OpenSolaris and
Illumos where `sched_yield()` is a wrapper for `yield()`, but that does
not work on any other platform.
The FreeBSD platform specific code implemented shims for these, but the
shim for `schedule()` forced us to wait, which is different than merely
rescheduling to another thread as the original Linux code does, while
the shim for `cond_resched()` had the same definition as its kernel
kpreempt() shim.
After studying this, I have concluded that we should reintroduce the
kpreempt() function in platform independent code with the following
definitions:
- In the Linux kernel:
kpreempt(unused) -> cond_resched()
- In the FreeBSD kernel:
kpreempt(unused) -> kern_yield(PRI_USER)
- In userspace:
kpreempt(unused) -> sched_yield()
In userspace, nothing changes from this cleanup. In the kernels, the
function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and
`cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on
FreeBSD and `schedule()` on Linux. This makes our behavior consistent
across platforms.
Note that Linux's SPL continues to use `cond_resched()` and
`schedule()`. However, those functions have been removed from both the
FreeBSD code and userspace code.
This should have the benefit of making it slightly easier to port the
code to new platforms by making how things should be mapped less
confusing.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13845
2022-09-12 19:55:37 +03:00
|
|
|
kpreempt(KPREEMPT_SYNC);
|
ARC shrinking blocks reads/writes
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`. Before commit 3ec34e55271d433e3c, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`. However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.
Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`. Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC. During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).
To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e55271d433e3c.
Note: commit 3ec34e55271d433e3c is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.
Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.
To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes #10496
2020-06-26 20:42:27 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
/*
|
|
|
|
* Allocate an array of buffer headers used as placeholders during arc state
|
|
|
|
* eviction.
|
|
|
|
*/
|
|
|
|
static arc_buf_hdr_t **
|
|
|
|
arc_state_alloc_markers(int count)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t **markers;
|
|
|
|
|
|
|
|
markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP);
|
|
|
|
for (int i = 0; i < count; i++) {
|
|
|
|
markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A b_spa of 0 is used to indicate that this header is
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* a marker. This fact is used in arc_evict_state_impl().
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
*/
|
|
|
|
markers[i]->b_spa = 0;
|
|
|
|
|
|
|
|
}
|
|
|
|
return (markers);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_state_free_markers(arc_buf_hdr_t **markers, int count)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < count; i++)
|
|
|
|
kmem_cache_free(hdr_full_cache, markers[i]);
|
|
|
|
kmem_free(markers, sizeof (*markers) * count);
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Evict buffers from the given arc state, until we've removed the
|
|
|
|
* specified number of bytes. Move the removed buffers to the
|
|
|
|
* appropriate evict state.
|
|
|
|
*
|
|
|
|
* This function makes a "best effort". It skips over any buffers
|
|
|
|
* it can't get a hash_lock on, and so, may not catch all candidates.
|
|
|
|
* It may also return without evicting as much space as requested.
|
|
|
|
*
|
|
|
|
* If bytes is specified using the special value ARC_EVICT_ALL, this
|
|
|
|
* will evict all available (i.e. unlocked and evictable) buffers from
|
|
|
|
* the given arc state; which is used by arc_flush().
|
|
|
|
*/
|
|
|
|
static uint64_t
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa,
|
|
|
|
uint64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t total_evicted = 0;
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_t *ml = &state->arcs_list[type];
|
2015-01-13 06:52:19 +03:00
|
|
|
int num_sublists;
|
|
|
|
arc_buf_hdr_t **markers;
|
|
|
|
|
|
|
|
num_sublists = multilist_get_num_sublists(ml);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* If we've tried to evict from each sublist, made some
|
|
|
|
* progress, but still have not hit the target number of bytes
|
|
|
|
* to evict, we want to keep trying. The markers allow us to
|
|
|
|
* pick up where we left off for each individual sublist, rather
|
|
|
|
* than starting from the tail each time.
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
if (zthr_iscurthread(arc_evict_zthr)) {
|
|
|
|
markers = arc_state_evict_markers;
|
|
|
|
ASSERT3S(num_sublists, <=, arc_state_evict_marker_count);
|
|
|
|
} else {
|
|
|
|
markers = arc_state_alloc_markers(num_sublists);
|
|
|
|
}
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_t *mls;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mls = multilist_sublist_lock(ml, i);
|
|
|
|
multilist_sublist_insert_tail(mls, markers[i]);
|
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* While we haven't hit our target number of bytes to evict, or
|
|
|
|
* we're evicting all available buffers.
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2021-07-20 17:13:21 +03:00
|
|
|
while (total_evicted < bytes) {
|
2016-07-13 15:42:40 +03:00
|
|
|
int sublist_idx = multilist_get_random_index(ml);
|
|
|
|
uint64_t scan_evicted = 0;
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Start eviction using a randomly selected sublist,
|
|
|
|
* this is to try and evenly balance eviction across all
|
|
|
|
* sublists. Always starting at the same sublist
|
|
|
|
* (e.g. index 0) would cause evictions to favor certain
|
|
|
|
* sublists over others.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t bytes_remaining;
|
|
|
|
uint64_t bytes_evicted;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2021-07-20 17:13:21 +03:00
|
|
|
if (total_evicted < bytes)
|
2015-01-13 06:52:19 +03:00
|
|
|
bytes_remaining = bytes - total_evicted;
|
|
|
|
else
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
|
|
|
|
markers[sublist_idx], spa, bytes_remaining);
|
|
|
|
|
|
|
|
scan_evicted += bytes_evicted;
|
|
|
|
total_evicted += bytes_evicted;
|
|
|
|
|
|
|
|
/* we've reached the end, wrap to the beginning */
|
|
|
|
if (++sublist_idx >= num_sublists)
|
|
|
|
sublist_idx = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we didn't evict anything during this scan, we have
|
|
|
|
* no reason to believe we'll evict more during another
|
|
|
|
* scan, so break the loop.
|
|
|
|
*/
|
|
|
|
if (scan_evicted == 0) {
|
|
|
|
/* This isn't possible, let's make that obvious */
|
|
|
|
ASSERT3S(bytes, !=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* When bytes is ARC_EVICT_ALL, the only way to
|
|
|
|
* break the loop is when scan_evicted is zero.
|
|
|
|
* In that case, we actually have evicted enough,
|
|
|
|
* so we don't want to increment the kstat.
|
|
|
|
*/
|
|
|
|
if (bytes != ARC_EVICT_ALL) {
|
|
|
|
ASSERT3S(total_evicted, <, bytes);
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_not_enough);
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
break;
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
|
|
|
|
multilist_sublist_remove(mls, markers[i]);
|
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
if (markers != arc_state_evict_markers)
|
|
|
|
arc_state_free_markers(markers, num_sublists);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush all "evictable" data of the given type from the arc state
|
|
|
|
* specified. This will not evict any "active" buffers (i.e. referenced).
|
|
|
|
*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When 'retry' is set to B_FALSE, the function will make a single pass
|
2015-01-13 06:52:19 +03:00
|
|
|
* over the state and evict any buffers that it can. Since it doesn't
|
|
|
|
* continually retry the eviction, it might end up leaving some buffers
|
|
|
|
* in the ARC due to lock misses.
|
|
|
|
*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When 'retry' is set to B_TRUE, the function will continually retry the
|
2015-01-13 06:52:19 +03:00
|
|
|
* eviction until *all* evictable buffers have been removed from the
|
|
|
|
* state. As a result, if concurrent insertions into the state are
|
|
|
|
* allowed (e.g. if the ARC isn't shutting down), this function might
|
|
|
|
* wind up in an infinite loop, continually trying to evict buffers.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
|
|
|
|
boolean_t retry)
|
|
|
|
{
|
|
|
|
uint64_t evicted = 0;
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
while (zfs_refcount_count(&state->arcs_esize[type]) != 0) {
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
evicted += arc_evict_state(state, type, spa, ARC_EVICT_ALL);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
if (!retry)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* Evict the specified number of bytes from the state specified. This
|
|
|
|
* function prevents us from trying to evict more from a state's list
|
|
|
|
* than is "evictable", and to skip evicting altogether when passed a
|
2015-01-13 06:52:19 +03:00
|
|
|
* negative value for "bytes". In contrast, arc_evict_state() will
|
|
|
|
* evict everything it can, when passed a negative value for "bytes".
|
|
|
|
*/
|
|
|
|
static uint64_t
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_evict_impl(arc_state_t *state, arc_buf_contents_t type, int64_t bytes)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
2021-07-20 17:13:21 +03:00
|
|
|
uint64_t delta;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) {
|
|
|
|
delta = MIN(zfs_refcount_count(&state->arcs_esize[type]),
|
|
|
|
bytes);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
return (arc_evict_state(state, type, 0, delta));
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* Adjust specified fraction, taking into account initial ghost state(s) size,
|
|
|
|
* ghost hit bytes towards increasing the fraction, ghost hit bytes towards
|
|
|
|
* decreasing it, plus a balance factor, controlling the decrease rate, used
|
|
|
|
* to balance metadata vs data.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
|
|
|
static uint64_t
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_evict_adj(uint64_t frac, uint64_t total, uint64_t up, uint64_t down,
|
|
|
|
uint_t balance)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (total < 8 || up + down == 0)
|
|
|
|
return (frac);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* We should not have more ghost hits than ghost size, but they
|
|
|
|
* may get close. Restrict maximum adjustment in that case.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (up + down >= total / 4) {
|
|
|
|
uint64_t scale = (up + down) / (total / 8);
|
|
|
|
up /= scale;
|
|
|
|
down /= scale;
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
/* Get maximal dynamic range by choosing optimal shifts. */
|
|
|
|
int s = highbit64(total);
|
|
|
|
s = MIN(64 - s, 32);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
uint64_t ofrac = (1ULL << 32) - frac;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (frac >= 4 * ofrac)
|
|
|
|
up /= frac / (2 * ofrac + 1);
|
|
|
|
up = (up << s) / (total >> (32 - s));
|
|
|
|
if (ofrac >= 4 * frac)
|
|
|
|
down /= ofrac / (2 * frac + 1);
|
|
|
|
down = (down << s) / (total >> (32 - s));
|
|
|
|
down = down * 100 / balance;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
return (frac + up - down);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2021-06-17 03:19:34 +03:00
|
|
|
* Evict buffers from the cache, such that arcstat_size is capped by arc_c.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
|
|
|
static uint64_t
|
2020-07-22 19:51:47 +03:00
|
|
|
arc_evict(void)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
uint64_t asize, bytes, total_evicted = 0;
|
|
|
|
int64_t e, mrud, mrum, mfud, mfum, w;
|
|
|
|
static uint64_t ogrd, ogrm, ogfd, ogfm;
|
|
|
|
static uint64_t gsrd, gsrm, gsfd, gsfm;
|
|
|
|
uint64_t ngrd, ngrm, ngfd, ngfm;
|
|
|
|
|
|
|
|
/* Get current size of ARC states we can evict from. */
|
|
|
|
mrud = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_DATA]) +
|
|
|
|
zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
mrum = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) +
|
|
|
|
zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
mfud = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
mfum = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
uint64_t d = mrud + mfud;
|
|
|
|
uint64_t m = mrum + mfum;
|
|
|
|
uint64_t t = d + m;
|
|
|
|
|
|
|
|
/* Get ARC ghost hits since last eviction. */
|
|
|
|
ngrd = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);
|
|
|
|
uint64_t grd = ngrd - ogrd;
|
|
|
|
ogrd = ngrd;
|
|
|
|
ngrm = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);
|
|
|
|
uint64_t grm = ngrm - ogrm;
|
|
|
|
ogrm = ngrm;
|
|
|
|
ngfd = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);
|
|
|
|
uint64_t gfd = ngfd - ogfd;
|
|
|
|
ogfd = ngfd;
|
|
|
|
ngfm = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);
|
|
|
|
uint64_t gfm = ngfm - ogfm;
|
|
|
|
ogfm = ngfm;
|
|
|
|
|
|
|
|
/* Adjust ARC states balance based on ghost hits. */
|
|
|
|
arc_meta = arc_evict_adj(arc_meta, gsrd + gsrm + gsfd + gsfm,
|
|
|
|
grm + gfm, grd + gfd, zfs_arc_meta_balance);
|
|
|
|
arc_pd = arc_evict_adj(arc_pd, gsrd + gsfd, grd, gfd, 100);
|
|
|
|
arc_pm = arc_evict_adj(arc_pm, gsrm + gsfm, grm, gfm, 100);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
asize = aggsum_value(&arc_sums.arcstat_size);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
int64_t wt = t - (asize - arc_c);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to reduce pinned dnodes if more than 3/4 of wanted metadata
|
|
|
|
* target is not evictable or if they go over arc_dnode_limit.
|
|
|
|
*/
|
|
|
|
int64_t prune = 0;
|
|
|
|
int64_t dn = wmsum_value(&arc_sums.arcstat_dnode_size);
|
2023-04-05 20:42:22 +03:00
|
|
|
w = wt * (int64_t)(arc_meta >> 16) >> 16;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) +
|
|
|
|
zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]) -
|
|
|
|
zfs_refcount_count(&arc_mru->arcs_esize[ARC_BUFC_METADATA]) -
|
|
|
|
zfs_refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]) >
|
|
|
|
w * 3 / 4) {
|
|
|
|
prune = dn / sizeof (dnode_t) *
|
|
|
|
zfs_arc_dnode_reduce_percent / 100;
|
|
|
|
} else if (dn > arc_dnode_limit) {
|
|
|
|
prune = (dn - arc_dnode_limit) / sizeof (dnode_t) *
|
|
|
|
zfs_arc_dnode_reduce_percent / 100;
|
|
|
|
}
|
|
|
|
if (prune > 0)
|
|
|
|
arc_prune_async(prune);
|
|
|
|
|
|
|
|
/* Evict MRU metadata. */
|
2023-04-05 20:42:22 +03:00
|
|
|
w = wt * (int64_t)(arc_meta * arc_pm >> 48) >> 16;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
e = MIN((int64_t)(asize - arc_c), (int64_t)(mrum - w));
|
|
|
|
bytes = arc_evict_impl(arc_mru, ARC_BUFC_METADATA, e);
|
2015-01-13 06:52:19 +03:00
|
|
|
total_evicted += bytes;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
mrum -= bytes;
|
|
|
|
asize -= bytes;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
/* Evict MFU metadata. */
|
2023-04-05 20:42:22 +03:00
|
|
|
w = wt * (int64_t)(arc_meta >> 16) >> 16;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
e = MIN((int64_t)(asize - arc_c), (int64_t)(m - w));
|
|
|
|
bytes = arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e);
|
|
|
|
total_evicted += bytes;
|
|
|
|
mfum -= bytes;
|
|
|
|
asize -= bytes;
|
|
|
|
|
|
|
|
/* Evict MRU data. */
|
|
|
|
wt -= m - total_evicted;
|
2023-04-05 20:42:22 +03:00
|
|
|
w = wt * (int64_t)(arc_pd >> 16) >> 16;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
e = MIN((int64_t)(asize - arc_c), (int64_t)(mrud - w));
|
|
|
|
bytes = arc_evict_impl(arc_mru, ARC_BUFC_DATA, e);
|
|
|
|
total_evicted += bytes;
|
|
|
|
mrud -= bytes;
|
|
|
|
asize -= bytes;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
/* Evict MFU data. */
|
|
|
|
e = asize - arc_c;
|
|
|
|
bytes = arc_evict_impl(arc_mfu, ARC_BUFC_DATA, e);
|
|
|
|
mfud -= bytes;
|
|
|
|
total_evicted += bytes;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* Evict ghost lists
|
2015-01-13 06:52:19 +03:00
|
|
|
*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* Size of each state's ghost list represents how much that state
|
|
|
|
* may grow by shrinking the other states. Would it need to shrink
|
|
|
|
* other states to zero (that is unlikely), its ghost size would be
|
|
|
|
* equal to sum of other three state sizes. But excessive ghost
|
|
|
|
* size may result in false ghost hits (too far back), that may
|
|
|
|
* never result in real cache hits if several states are competing.
|
|
|
|
* So choose some arbitraty point of 1/2 of other state sizes.
|
|
|
|
*/
|
|
|
|
gsrd = (mrum + mfud + mfum) / 2;
|
|
|
|
e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]) -
|
|
|
|
gsrd;
|
|
|
|
(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_DATA, e);
|
|
|
|
|
|
|
|
gsrm = (mrud + mfud + mfum) / 2;
|
|
|
|
e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]) -
|
|
|
|
gsrm;
|
|
|
|
(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_METADATA, e);
|
|
|
|
|
|
|
|
gsfd = (mrud + mrum + mfum) / 2;
|
|
|
|
e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]) -
|
|
|
|
gsfd;
|
|
|
|
(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_DATA, e);
|
|
|
|
|
|
|
|
gsfm = (mrud + mrum + mfud) / 2;
|
|
|
|
e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]) -
|
|
|
|
gsfm;
|
|
|
|
(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_METADATA, e);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_flush(spa_t *spa, boolean_t retry)
|
2011-12-23 00:20:43 +04:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t guid = 0;
|
2014-01-03 23:40:52 +04:00
|
|
|
|
2015-03-18 01:08:22 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If retry is B_TRUE, a spa must not be specified since we have
|
2015-01-13 06:52:19 +03:00
|
|
|
* no good way to determine if all of a spa's buffers have been
|
|
|
|
* evicted from an arc state.
|
2015-03-18 01:08:22 +03:00
|
|
|
*/
|
2023-01-11 00:58:21 +03:00
|
|
|
ASSERT(!retry || spa == NULL);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (spa != NULL)
|
2011-11-12 02:07:54 +04:00
|
|
|
guid = spa_load_guid(spa);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
(void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
|
|
|
|
|
|
|
|
(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
|
|
|
|
|
|
|
|
(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
|
|
|
|
(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_METADATA, retry);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-10-18 20:23:19 +03:00
|
|
|
void
|
2017-03-16 02:41:52 +03:00
|
|
|
arc_reduce_target_size(int64_t to_free)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
uint64_t c = arc_c;
|
|
|
|
|
|
|
|
if (c <= arc_c_min)
|
|
|
|
return;
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* All callers want the ARC to actually evict (at least) this much
|
|
|
|
* memory. Therefore we reduce from the lower of the current size and
|
|
|
|
* the target size. This way, even if arc_c is much higher than
|
|
|
|
* arc_size (as can be the case after many calls to arc_freed(), we will
|
|
|
|
* immediately have arc_c < arc_size and therefore the arc_evict_zthr
|
|
|
|
* will evict.
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
uint64_t asize = aggsum_value(&arc_sums.arcstat_size);
|
|
|
|
if (asize < c)
|
|
|
|
to_free += c - asize;
|
|
|
|
arc_c = MAX((int64_t)c - to_free, (int64_t)arc_c_min);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
/* See comment in arc_evict_cb_check() on why lock+flag */
|
|
|
|
mutex_enter(&arc_evict_lock);
|
|
|
|
arc_evict_needed = B_TRUE;
|
|
|
|
mutex_exit(&arc_evict_lock);
|
|
|
|
zthr_wakeup(arc_evict_zthr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-06-26 21:28:18 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine if the system is under memory pressure and is asking
|
2016-06-02 07:04:53 +03:00
|
|
|
* to reclaim memory. A return value of B_TRUE indicates that the system
|
2015-06-26 21:28:18 +03:00
|
|
|
* is under memory pressure and that the arc should adjust accordingly.
|
|
|
|
*/
|
2019-10-18 20:23:19 +03:00
|
|
|
boolean_t
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_reclaim_needed(void)
|
|
|
|
{
|
|
|
|
return (arc_available_memory() < 0);
|
|
|
|
}
|
|
|
|
|
2019-10-18 20:23:19 +03:00
|
|
|
void
|
2017-03-16 02:41:52 +03:00
|
|
|
arc_kmem_reap_soon(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
kmem_cache_t *prev_cache = NULL;
|
|
|
|
kmem_cache_t *prev_data_cache = NULL;
|
|
|
|
|
2017-10-11 01:19:19 +03:00
|
|
|
#ifdef _KERNEL
|
|
|
|
#if defined(_ILP32)
|
|
|
|
/*
|
|
|
|
* Reclaim unused memory from all kmem caches.
|
|
|
|
*/
|
|
|
|
kmem_reap();
|
|
|
|
#endif
|
|
|
|
#endif
|
2015-05-30 17:57:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
|
2017-10-11 01:19:19 +03:00
|
|
|
#if defined(_ILP32)
|
2015-10-31 00:34:22 +03:00
|
|
|
/* reach upper limit of cache size on 32-bit */
|
|
|
|
if (zio_buf_cache[i] == NULL)
|
|
|
|
break;
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zio_buf_cache[i] != prev_cache) {
|
|
|
|
prev_cache = zio_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_buf_cache[i]);
|
|
|
|
}
|
|
|
|
if (zio_data_buf_cache[i] != prev_data_cache) {
|
|
|
|
prev_data_cache = zio_data_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_data_buf_cache[i]);
|
|
|
|
}
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
kmem_cache_reap_now(buf_cache);
|
2014-12-30 06:12:23 +03:00
|
|
|
kmem_cache_reap_now(hdr_full_cache);
|
|
|
|
kmem_cache_reap_now(hdr_l2only_cache);
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
kmem_cache_reap_now(zfs_btree_leaf_cache);
|
2020-06-18 07:44:13 +03:00
|
|
|
abd_cache_reap_now();
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
static boolean_t
|
2020-07-22 19:51:47 +03:00
|
|
|
arc_evict_cb_check(void *arg, zthr_t *zthr)
|
2017-03-16 02:41:52 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) arg, (void) zthr;
|
|
|
|
|
2020-12-17 22:16:42 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* This is necessary in order to keep the kstat information
|
|
|
|
* up to date for tools that display kstat data such as the
|
|
|
|
* mdb ::arc dcmd and the Linux crash utility. These tools
|
|
|
|
* typically do not call kstat's update function, but simply
|
|
|
|
* dump out stats from the most recent update. Without
|
|
|
|
* this call, these commands may show stale stats for the
|
2020-12-17 22:16:42 +03:00
|
|
|
* anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
|
|
|
|
* with this call, the data might be out of date if the
|
|
|
|
* evict thread hasn't been woken recently; but that should
|
|
|
|
* suffice. The arc_state_t structures can be queried
|
|
|
|
* directly if more accurate information is needed.
|
2017-03-16 02:41:52 +03:00
|
|
|
*/
|
|
|
|
if (arc_ksp != NULL)
|
|
|
|
arc_ksp->ks_update(arc_ksp, KSTAT_READ);
|
2020-12-17 22:16:42 +03:00
|
|
|
#endif
|
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
/*
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
* We have to rely on arc_wait_for_eviction() to tell us when to
|
|
|
|
* evict, rather than checking if we are overflowing here, so that we
|
|
|
|
* are sure to not leave arc_wait_for_eviction() waiting on aew_cv.
|
|
|
|
* If we have become "not overflowing" since arc_wait_for_eviction()
|
|
|
|
* checked, we need to wake it up. We could broadcast the CV here,
|
|
|
|
* but arc_wait_for_eviction() may have not yet gone to sleep. We
|
|
|
|
* would need to use a mutex to ensure that this function doesn't
|
|
|
|
* broadcast until arc_wait_for_eviction() has gone to sleep (e.g.
|
|
|
|
* the arc_evict_lock). However, the lock ordering of such a lock
|
|
|
|
* would necessarily be incorrect with respect to the zthr_lock,
|
|
|
|
* which is held before this function is called, and is held by
|
|
|
|
* arc_wait_for_eviction() when it calls zthr_wakeup().
|
2017-03-16 02:41:52 +03:00
|
|
|
*/
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (arc_evict_needed)
|
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have buffers in uncached state, evict them periodically.
|
|
|
|
*/
|
|
|
|
return ((zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_DATA]) +
|
|
|
|
zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]) &&
|
|
|
|
ddi_get_lbolt() - arc_last_uncached_flush >
|
|
|
|
MSEC_TO_TICK(arc_min_prefetch_ms / 2)));
|
2017-03-16 02:41:52 +03:00
|
|
|
}
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
2020-07-22 19:51:47 +03:00
|
|
|
* Keep arc_size under arc_c by running arc_evict which evicts data
|
2017-03-16 02:41:52 +03:00
|
|
|
* from the ARC.
|
2012-03-14 01:29:16 +04:00
|
|
|
*/
|
2019-01-13 21:09:46 +03:00
|
|
|
static void
|
2020-07-22 19:51:47 +03:00
|
|
|
arc_evict_cb(void *arg, zthr_t *zthr)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) arg, (void) zthr;
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
uint64_t evicted = 0;
|
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
/* Always try to evict from uncached state. */
|
|
|
|
arc_last_uncached_flush = ddi_get_lbolt();
|
|
|
|
evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_DATA, B_FALSE);
|
|
|
|
evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_METADATA, B_FALSE);
|
|
|
|
|
|
|
|
/* Evict from other states only if told to. */
|
|
|
|
if (arc_evict_needed)
|
|
|
|
evicted += arc_evict();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* If evicted is zero, we couldn't evict anything
|
2020-07-22 19:51:47 +03:00
|
|
|
* via arc_evict(). This could be due to hash lock
|
2017-03-16 02:41:52 +03:00
|
|
|
* collisions, but more likely due to the majority of
|
|
|
|
* arc buffers being unevictable. Therefore, even if
|
|
|
|
* arc_size is above arc_c, another pass is unlikely to
|
|
|
|
* be helpful and could potentially cause us to enter an
|
|
|
|
* infinite loop. Additionally, zthr_iscancelled() is
|
|
|
|
* checked here so that if the arc is shutting down, the
|
2020-07-22 19:51:47 +03:00
|
|
|
* broadcast will wake any remaining arc evict waiters.
|
2017-03-16 02:41:52 +03:00
|
|
|
*/
|
2020-07-22 19:51:47 +03:00
|
|
|
mutex_enter(&arc_evict_lock);
|
|
|
|
arc_evict_needed = !zthr_iscancelled(arc_evict_zthr) &&
|
2021-06-17 03:19:34 +03:00
|
|
|
evicted > 0 && aggsum_compare(&arc_sums.arcstat_size, arc_c) > 0;
|
2020-07-22 19:51:47 +03:00
|
|
|
if (!arc_evict_needed) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2017-03-16 02:41:52 +03:00
|
|
|
* We're either no longer overflowing, or we
|
|
|
|
* can't evict anything more, so we should wake
|
|
|
|
* arc_get_data_impl() sooner.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
arc_evict_waiter_t *aw;
|
|
|
|
while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) {
|
|
|
|
cv_broadcast(&aw->aew_cv);
|
|
|
|
}
|
|
|
|
arc_set_need_free();
|
2017-03-16 02:41:52 +03:00
|
|
|
}
|
2020-07-22 19:51:47 +03:00
|
|
|
mutex_exit(&arc_evict_lock);
|
2017-03-16 02:41:52 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
}
|
|
|
|
|
|
|
|
static boolean_t
|
|
|
|
arc_reap_cb_check(void *arg, zthr_t *zthr)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) arg, (void) zthr;
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
int64_t free_memory = arc_available_memory();
|
2020-09-30 23:22:34 +03:00
|
|
|
static int reap_cb_check_counter = 0;
|
2017-03-16 02:41:52 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If a kmem reap is already active, don't schedule more. We must
|
|
|
|
* check for this because kmem_cache_reap_soon() won't actually
|
|
|
|
* block on the cache being reaped (this is to prevent callers from
|
|
|
|
* becoming implicitly blocked by a system-wide kmem reap -- which,
|
|
|
|
* on a system with many, many full magazines, can take minutes).
|
|
|
|
*/
|
|
|
|
if (!kmem_cache_reap_active() && free_memory < 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
arc_no_grow = B_TRUE;
|
|
|
|
arc_warm = B_TRUE;
|
2017-02-04 20:21:25 +03:00
|
|
|
/*
|
2017-03-16 02:41:52 +03:00
|
|
|
* Wait at least zfs_grow_retry (default 5) seconds
|
|
|
|
* before considering growing.
|
2017-02-04 20:21:25 +03:00
|
|
|
*/
|
2017-03-16 02:41:52 +03:00
|
|
|
arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry);
|
|
|
|
return (B_TRUE);
|
|
|
|
} else if (free_memory < arc_c >> arc_no_grow_shift) {
|
|
|
|
arc_no_grow = B_TRUE;
|
|
|
|
} else if (gethrtime() >= arc_growtime) {
|
|
|
|
arc_no_grow = B_FALSE;
|
|
|
|
}
|
2017-02-04 20:21:25 +03:00
|
|
|
|
2020-09-30 23:22:34 +03:00
|
|
|
/*
|
|
|
|
* Called unconditionally every 60 seconds to reclaim unused
|
|
|
|
* zstd compression and decompression context. This is done
|
|
|
|
* here to avoid the need for an independent thread.
|
|
|
|
*/
|
|
|
|
if (!((reap_cb_check_counter++) % 60))
|
|
|
|
zfs_zstd_cache_reap_now();
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* Keep enough free memory in the system by reaping the ARC's kmem
|
|
|
|
* caches. To cause more slabs to be reapable, we may reduce the
|
2020-07-22 19:51:47 +03:00
|
|
|
* target size of the cache (arc_c), causing the arc_evict_cb()
|
2017-03-16 02:41:52 +03:00
|
|
|
* to free more buffers.
|
|
|
|
*/
|
2019-01-13 21:09:46 +03:00
|
|
|
static void
|
2017-03-16 02:41:52 +03:00
|
|
|
arc_reap_cb(void *arg, zthr_t *zthr)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) arg, (void) zthr;
|
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
int64_t free_memory;
|
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* Kick off asynchronous kmem_reap()'s of all our caches.
|
|
|
|
*/
|
|
|
|
arc_kmem_reap_soon();
|
2011-03-31 05:59:17 +04:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* Wait at least arc_kmem_cache_reap_retry_ms between
|
|
|
|
* arc_kmem_reap_soon() calls. Without this check it is possible to
|
|
|
|
* end up in a situation where we spend lots of time reaping
|
|
|
|
* caches, while we're near arc_c_min. Waiting here also gives the
|
|
|
|
* subsequent free memory check a chance of finding that the
|
|
|
|
* asynchronous reap has already freed enough memory, and we don't
|
|
|
|
* need to call arc_reduce_target_size().
|
|
|
|
*/
|
|
|
|
delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* Reduce the target size as needed to maintain the amount of free
|
|
|
|
* memory in the system at a fraction of the arc_size (1/128th by
|
|
|
|
* default). If oversubscribed (free_memory < 0) then reduce the
|
|
|
|
* target arc_size by the deficit amount plus the fractional
|
2021-04-03 04:38:53 +03:00
|
|
|
* amount. If free memory is positive but less than the fractional
|
2017-03-16 02:41:52 +03:00
|
|
|
* amount, reduce by what is needed to hit the fractional amount.
|
|
|
|
*/
|
|
|
|
free_memory = arc_available_memory();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-09-02 23:21:18 +03:00
|
|
|
int64_t can_free = arc_c - arc_c_min;
|
|
|
|
if (can_free > 0) {
|
|
|
|
int64_t to_free = (can_free >> arc_shrink_shift) - free_memory;
|
|
|
|
if (to_free > 0)
|
|
|
|
arc_reduce_target_size(to_free);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
}
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
2012-03-14 01:29:16 +04:00
|
|
|
* Determine the amount of memory eligible for eviction contained in the
|
|
|
|
* ARC. All clean data reported by the ghost lists can always be safely
|
|
|
|
* evicted. Due to arc_c_min, the same does not hold for all clean data
|
|
|
|
* contained by the regular mru and mfu lists.
|
|
|
|
*
|
|
|
|
* In the case of the regular mru and mfu lists, we need to report as
|
|
|
|
* much clean data as possible, such that evicting that same reported
|
|
|
|
* data will not bring arc_size below arc_c_min. Thus, in certain
|
|
|
|
* circumstances, the total amount of clean data in the mru and mfu
|
|
|
|
* lists might not actually be evictable.
|
|
|
|
*
|
|
|
|
* The following two distinct cases are accounted for:
|
|
|
|
*
|
|
|
|
* 1. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is greater than or equal to arc_c_min.
|
|
|
|
* (i.e. amount of dirty data >= arc_c_min)
|
|
|
|
*
|
|
|
|
* This is the easy case; all clean data contained by the mru and mfu
|
|
|
|
* lists is evictable. Evicting all clean data can only drop arc_size
|
|
|
|
* to the amount of dirty data, which is greater than arc_c_min.
|
|
|
|
*
|
|
|
|
* 2. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is less than arc_c_min.
|
|
|
|
* (i.e. arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* 2.1. arc_size is greater than or equal arc_c_min.
|
|
|
|
* (i.e. arc_size >= arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, not all clean data from the regular mru and mfu
|
|
|
|
* lists is actually evictable; we must leave enough clean data
|
|
|
|
* to keep arc_size above arc_c_min. Thus, the maximum amount of
|
|
|
|
* evictable data from the two lists combined, is exactly the
|
|
|
|
* difference between arc_size and arc_c_min.
|
|
|
|
*
|
|
|
|
* 2.2. arc_size is less than arc_c_min
|
|
|
|
* (i.e. arc_c_min > arc_size > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, none of the data contained in the mru and mfu
|
|
|
|
* lists is evictable, even if it's clean. Since arc_size is
|
|
|
|
* already below arc_c_min, evicting any more would only
|
|
|
|
* increase this negative difference.
|
2011-03-30 05:08:59 +04:00
|
|
|
*/
|
|
|
|
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Adapt arc info given the number of bytes we are trying to add and
|
2017-01-03 20:31:18 +03:00
|
|
|
* the state that we are coming from. This function is only called
|
2008-11-20 23:01:55 +03:00
|
|
|
* when we are adding new content to the cache.
|
|
|
|
*/
|
|
|
|
static void
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_adapt(uint64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-03-16 02:41:52 +03:00
|
|
|
/*
|
|
|
|
* Wake reap thread if we do not have any available memory
|
|
|
|
*/
|
2015-06-26 21:28:18 +03:00
|
|
|
if (arc_reclaim_needed()) {
|
2017-03-16 02:41:52 +03:00
|
|
|
zthr_wakeup(arc_reap_zthr);
|
2015-06-26 21:28:18 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (arc_no_grow)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (arc_c >= arc_c_max)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're within (2 * maxblocksize) bytes of the target
|
|
|
|
* cache size, increment the target cache size
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (aggsum_upper_bound(&arc_sums.arcstat_size) +
|
|
|
|
2 * SPA_MAXBLOCKSIZE >= arc_c) {
|
|
|
|
uint64_t dc = MAX(bytes, SPA_OLD_MAXBLOCKSIZE);
|
|
|
|
if (atomic_add_64_nv(&arc_c, dc) > arc_c_max)
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_c = arc_c_max;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Check if arc_size has grown past our upper threshold, determined by
|
|
|
|
* zfs_arc_overflow_shift.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
static arc_ovf_level_t
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_is_overflowing(boolean_t use_reserve)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
/* Always allow at least one block of overflow */
|
2019-06-10 19:52:25 +03:00
|
|
|
int64_t overflow = MAX(SPA_MAXBLOCKSIZE,
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_c >> zfs_arc_overflow_shift);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-05-25 21:32:40 +03:00
|
|
|
/*
|
|
|
|
* We just compare the lower bound here for performance reasons. Our
|
|
|
|
* primary goals are to make sure that the arc never grows without
|
|
|
|
* bound, and that it can reach its maximum size. This check
|
|
|
|
* accomplishes both goals. The maximum amount we could run over by is
|
|
|
|
* 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block
|
|
|
|
* in the ARC. In practice, that's in the tens of MB, which is low
|
|
|
|
* enough to be safe.
|
|
|
|
*/
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
int64_t over = aggsum_lower_bound(&arc_sums.arcstat_size) -
|
|
|
|
arc_c - overflow / 2;
|
2021-08-17 19:15:54 +03:00
|
|
|
if (!use_reserve)
|
|
|
|
overflow /= 2;
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
return (over < 0 ? ARC_OVF_NONE :
|
|
|
|
over < overflow ? ARC_OVF_SOME : ARC_OVF_SEVERE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
static abd_t *
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,
|
2021-08-17 19:15:54 +03:00
|
|
|
int alloc_flags)
|
2016-07-22 18:52:49 +03:00
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_get_data_impl(hdr, size, tag, alloc_flags);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (alloc_flags & ARC_HDR_ALLOC_LINEAR)
|
|
|
|
return (abd_alloc_linear(size, type == ARC_BUFC_METADATA));
|
|
|
|
else
|
|
|
|
return (abd_alloc(size, type == ARC_BUFC_METADATA));
|
2016-07-22 18:52:49 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void *
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)
|
2016-07-22 18:52:49 +03:00
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_get_data_impl(hdr, size, tag, 0);
|
2016-07-22 18:52:49 +03:00
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
return (zio_buf_alloc(size));
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
return (zio_data_buf_alloc(size));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
/*
|
|
|
|
* Wait for the specified amount of data (in bytes) to be evicted from the
|
|
|
|
* ARC, and for there to be sufficient free memory in the system. Waiting for
|
|
|
|
* eviction ensures that the memory used by the ARC decreases. Waiting for
|
|
|
|
* free memory ensures that the system won't run out of free pages, regardless
|
|
|
|
* of ARC behavior and settings. See arc_lowmem_init().
|
|
|
|
*/
|
|
|
|
void
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_wait_for_eviction(uint64_t amount, boolean_t use_reserve)
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
{
|
2021-08-17 19:15:54 +03:00
|
|
|
switch (arc_is_overflowing(use_reserve)) {
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
case ARC_OVF_NONE:
|
|
|
|
return;
|
|
|
|
case ARC_OVF_SOME:
|
|
|
|
/*
|
|
|
|
* This is a bit racy without taking arc_evict_lock, but the
|
|
|
|
* worst that can happen is we either call zthr_wakeup() extra
|
|
|
|
* time due to race with other thread here, or the set flag
|
|
|
|
* get cleared by arc_evict_cb(), which is unlikely due to
|
|
|
|
* big hysteresis, but also not important since at this level
|
|
|
|
* of overflow the eviction is purely advisory. Same time
|
|
|
|
* taking the global lock here every time without waiting for
|
|
|
|
* the actual eviction creates a significant lock contention.
|
|
|
|
*/
|
|
|
|
if (!arc_evict_needed) {
|
|
|
|
arc_evict_needed = B_TRUE;
|
|
|
|
zthr_wakeup(arc_evict_zthr);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
case ARC_OVF_SEVERE:
|
|
|
|
default:
|
|
|
|
{
|
|
|
|
arc_evict_waiter_t aw;
|
|
|
|
list_link_init(&aw.aew_node);
|
|
|
|
cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
uint64_t last_count = 0;
|
|
|
|
mutex_enter(&arc_evict_lock);
|
|
|
|
if (!list_is_empty(&arc_evict_waiters)) {
|
|
|
|
arc_evict_waiter_t *last =
|
|
|
|
list_tail(&arc_evict_waiters);
|
|
|
|
last_count = last->aew_count;
|
|
|
|
} else if (!arc_evict_needed) {
|
|
|
|
arc_evict_needed = B_TRUE;
|
|
|
|
zthr_wakeup(arc_evict_zthr);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Note, the last waiter's count may be less than
|
|
|
|
* arc_evict_count if we are low on memory in which
|
|
|
|
* case arc_evict_state_impl() may have deferred
|
|
|
|
* wakeups (but still incremented arc_evict_count).
|
|
|
|
*/
|
|
|
|
aw.aew_count = MAX(last_count, arc_evict_count) + amount;
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
list_insert_tail(&arc_evict_waiters, &aw);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
arc_set_need_free();
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
DTRACE_PROBE3(arc__wait__for__eviction,
|
|
|
|
uint64_t, amount,
|
|
|
|
uint64_t, arc_evict_count,
|
|
|
|
uint64_t, aw.aew_count);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
/*
|
|
|
|
* We will be woken up either when arc_evict_count reaches
|
|
|
|
* aew_count, or when the ARC is no longer overflowing and
|
|
|
|
* eviction completes.
|
|
|
|
* In case of "false" wakeup, we will still be on the list.
|
|
|
|
*/
|
|
|
|
do {
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
cv_wait(&aw.aew_cv, &arc_evict_lock);
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
} while (list_link_active(&aw.aew_node));
|
|
|
|
mutex_exit(&arc_evict_lock);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Fix ARC ghost states eviction accounting
arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.
To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.
To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock, reducing the contention.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #12279
2021-07-13 18:41:59 +03:00
|
|
|
cv_destroy(&aw.aew_cv);
|
|
|
|
}
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Allocate a block and return it to the caller. If we are hitting the
|
|
|
|
* hard limit for the cache size, we must sleep, waiting for the eviction
|
|
|
|
* thread to catch up. If we're past the target size but below the hard
|
|
|
|
* limit, we'll only signal the reclaim thread and continue on.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
static void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,
|
2021-08-17 19:15:54 +03:00
|
|
|
int alloc_flags)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_adapt(size);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
* If arc_size is currently overflowing, we must be adding data
|
|
|
|
* faster than we are evicting. To ensure we don't compound the
|
2015-01-13 06:52:19 +03:00
|
|
|
* problem by adding more data and forcing arc_size to grow even
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
* further past it's target size, we wait for the eviction thread to
|
|
|
|
* make some progress. We also wait for there to be sufficient free
|
|
|
|
* memory in the system, as measured by arc_free_memory().
|
|
|
|
*
|
|
|
|
* Specifically, we wait for zfs_arc_eviction_pct percent of the
|
|
|
|
* requested size to be evicted. This should be more than 100%, to
|
|
|
|
* ensure that that progress is also made towards getting arc_size
|
|
|
|
* under arc_c. See the comment above zfs_arc_eviction_pct.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100,
|
|
|
|
alloc_flags & ARC_HDR_USE_RESERVE);
|
2011-12-23 00:20:43 +04:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (type == ARC_BUFC_METADATA) {
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_space_consume(size, ARC_SPACE_META);
|
|
|
|
} else {
|
|
|
|
arc_space_consume(size, ARC_SPACE_DATA);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Update the state size. Note that ghost states have a
|
|
|
|
* "ghost size" and so don't need to be updated.
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!GHOST_STATE(state)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_add_many(&state->arcs_size[type], size,
|
|
|
|
tag);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is reached via arc_read, the link is
|
|
|
|
* protected by the hash lock. If reached via
|
|
|
|
* arc_buf_alloc, the header should not be accessed by
|
|
|
|
* any other thread. And, if reached via arc_read_done,
|
|
|
|
* the hash lock will protect it if it's found in the
|
|
|
|
* hash table; otherwise no other thread should be
|
|
|
|
* trying to [add|remove]_reference it.
|
|
|
|
*/
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
(void) zfs_refcount_add_many(&state->arcs_esize[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
size, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2016-07-22 18:52:49 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size,
|
|
|
|
const void *tag)
|
2016-07-22 18:52:49 +03:00
|
|
|
{
|
|
|
|
arc_free_data_impl(hdr, size, tag);
|
|
|
|
abd_free(abd);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, const void *tag)
|
2016-07-22 18:52:49 +03:00
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
arc_free_data_impl(hdr, size, tag);
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
zio_buf_free(buf, size);
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
zio_data_buf_free(buf, size);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free the arc data buffer.
|
|
|
|
*/
|
|
|
|
static void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
/* protected by hash lock, if in the hash table */
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(state != arc_anon && state != arc_l2c_only);
|
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_esize[type],
|
2016-06-02 07:04:53 +03:00
|
|
|
size, tag);
|
|
|
|
}
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_size[type], size, tag);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
arc_space_return(size, ARC_SPACE_META);
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
arc_space_return(size, ARC_SPACE_DATA);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This routine is called whenever a buffer is accessed.
|
|
|
|
*/
|
|
|
|
static void
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_access(arc_buf_hdr_t *hdr, arc_flags_t arc_flags, boolean_t hit)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
/*
|
|
|
|
* Update buffer prefetch status.
|
|
|
|
*/
|
|
|
|
boolean_t was_prefetch = HDR_PREFETCH(hdr);
|
|
|
|
boolean_t now_prefetch = arc_flags & ARC_FLAG_PREFETCH;
|
|
|
|
if (was_prefetch != now_prefetch) {
|
|
|
|
if (was_prefetch) {
|
|
|
|
ARCSTAT_CONDSTAT(hit, demand_hit, demand_iohit,
|
|
|
|
HDR_PRESCIENT_PREFETCH(hdr), prescient, predictive,
|
|
|
|
prefetch);
|
|
|
|
}
|
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
l2arc_hdr_arcstats_decrement_state(hdr);
|
|
|
|
if (was_prefetch) {
|
|
|
|
arc_hdr_clear_flags(hdr,
|
|
|
|
ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH);
|
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
|
|
|
|
}
|
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
l2arc_hdr_arcstats_increment_state(hdr);
|
|
|
|
}
|
|
|
|
if (now_prefetch) {
|
|
|
|
if (arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH);
|
|
|
|
ARCSTAT_BUMP(arcstat_prescient_prefetch);
|
|
|
|
} else {
|
|
|
|
ARCSTAT_BUMP(arcstat_predictive_prefetch);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (arc_flags & ARC_FLAG_L2CACHE)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
|
|
|
|
|
|
|
|
clock_t now = ddi_get_lbolt();
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_state_t *new_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
* This buffer is not in the cache, and does not appear in
|
|
|
|
* our "ghost" lists. Add it to the MRU or uncached state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT0(hdr->b_l1hdr.b_arc_access);
|
2022-12-22 23:10:24 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (HDR_UNCACHED(hdr)) {
|
|
|
|
new_state = arc_uncached;
|
|
|
|
DTRACE_PROBE1(new_state__uncached, arc_buf_hdr_t *,
|
|
|
|
hdr);
|
|
|
|
} else {
|
|
|
|
new_state = arc_mru;
|
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
|
|
|
|
}
|
|
|
|
arc_change_state(new_state, hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mru) {
|
2022-12-22 23:10:24 +03:00
|
|
|
/*
|
|
|
|
* This buffer has been accessed once recently and either
|
|
|
|
* its read is still in progress or it is in the cache.
|
|
|
|
*/
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
hdr->b_l1hdr.b_mru_hits++;
|
|
|
|
ARCSTAT_BUMP(arcstat_mru_hits);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* If the previous access was a prefetch, then it already
|
|
|
|
* handled possible promotion, so nothing more to do for now.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
if (was_prefetch) {
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* If more than ARC_MINTIME have passed from the previous
|
|
|
|
* hit, promote the buffer to the MFU state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +
|
|
|
|
ARC_MINTIME)) {
|
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_mfu, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_state_t *new_state;
|
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* This buffer has been accessed once recently, but was
|
|
|
|
* evicted from the cache. Would we have bigger MRU, it
|
|
|
|
* would be an MRU hit, so handle it the same way, except
|
|
|
|
* we don't need to check the previous access time.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
hdr->b_l1hdr.b_mru_ghost_hits++;
|
|
|
|
ARCSTAT_BUMP(arcstat_mru_ghost_hits);
|
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_add(&arc_mru_ghost->arcs_hits[arc_buf_type(hdr)],
|
|
|
|
arc_hdr_size(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
if (was_prefetch) {
|
2008-11-20 23:01:55 +03:00
|
|
|
new_state = arc_mru;
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
new_state = arc_mfu;
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(new_state, hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mfu) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* This buffer has been accessed more than once and either
|
|
|
|
* still in the cache or being restored from one of ghosts.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
if (!HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
hdr->b_l1hdr.b_mfu_hits++;
|
|
|
|
ARCSTAT_BUMP(arcstat_mfu_hits);
|
|
|
|
}
|
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* This buffer has been accessed more than once recently, but
|
|
|
|
* has been evicted from the cache. Would we have bigger MFU
|
|
|
|
* it would stay in cache, so move it back to MFU state.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2021-08-17 18:50:31 +03:00
|
|
|
hdr->b_l1hdr.b_mfu_ghost_hits++;
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_add(&arc_mfu_ghost->arcs_hits[arc_buf_type(hdr)],
|
|
|
|
arc_hdr_size(hdr));
|
2022-12-22 23:10:24 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(arc_mfu, hdr);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_uncached) {
|
|
|
|
/*
|
|
|
|
* This buffer is uncacheable, but we got a hit. Probably
|
|
|
|
* a demand read after prefetch. Nothing more to do here.
|
|
|
|
*/
|
|
|
|
if (!HDR_IO_IN_PROGRESS(hdr))
|
|
|
|
ARCSTAT_BUMP(arcstat_uncached_hits);
|
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2022-12-22 23:10:24 +03:00
|
|
|
* This buffer is on the 2nd Level ARC and was not accessed
|
|
|
|
* for a long time, so treat it as new and put into MRU.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(arc_mru, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
cmn_err(CE_PANIC, "invalid arc state 0x%p",
|
|
|
|
hdr->b_l1hdr.b_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-01-08 20:52:36 +03:00
|
|
|
/*
|
|
|
|
* This routine is called by dbuf_hold() to update the arc_access() state
|
|
|
|
* which otherwise would be skipped for entries in the dbuf cache.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_buf_access(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid taking the hash_lock when possible as an optimization.
|
|
|
|
* The header must be checked again under the hash_lock in order
|
|
|
|
* to handle the case where it is concurrently being released.
|
|
|
|
*/
|
2023-01-09 21:45:17 +03:00
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr))
|
2018-01-08 20:52:36 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
kmutex_t *hash_lock = HDR_LOCK(hdr);
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_access_skip);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
hdr->b_l1hdr.b_state == arc_mfu ||
|
|
|
|
hdr->b_l1hdr.b_state == arc_uncached);
|
2018-01-08 20:52:36 +03:00
|
|
|
|
|
|
|
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_access(hdr, 0, B_TRUE);
|
2018-01-08 20:52:36 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
ARCSTAT_BUMP(arcstat_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
ARCSTAT_CONDSTAT(B_TRUE /* demand */, demand, prefetch,
|
|
|
|
!HDR_ISTYPE_METADATA(hdr), data, metadata, hits);
|
2018-01-08 20:52:36 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/* a generic arc_read_done_func_t which you can use */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2017-11-16 04:27:01 +03:00
|
|
|
arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,
|
|
|
|
arc_buf_t *buf, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zio, (void) zb, (void) bp;
|
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
if (buf == NULL)
|
|
|
|
return;
|
|
|
|
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(arg, buf->b_data, arc_buf_size(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, arg);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/* a generic arc_read_done_func_t */
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2017-11-16 04:27:01 +03:00
|
|
|
arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,
|
|
|
|
arc_buf_t *buf, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) zb, (void) bp;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_t **bufp = arg;
|
2017-11-16 04:27:01 +03:00
|
|
|
|
|
|
|
if (buf == NULL) {
|
2018-08-29 21:33:33 +03:00
|
|
|
ASSERT(zio == NULL || zio->io_error != 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
*bufp = NULL;
|
|
|
|
} else {
|
2018-08-29 21:33:33 +03:00
|
|
|
ASSERT(zio == NULL || zio->io_error == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
*bufp = buf;
|
2018-08-29 21:33:33 +03:00
|
|
|
ASSERT(buf->b_data != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
|
|
|
|
{
|
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
|
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF);
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
|
|
|
if (HDR_COMPRESSION_ENABLED(hdr)) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(arc_hdr_get_compress(hdr), ==,
|
2016-06-02 07:04:53 +03:00
|
|
|
BP_GET_COMPRESS(bp));
|
|
|
|
}
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
|
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp));
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_read_done(zio_t *zio)
|
|
|
|
{
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
blkptr_t *bp = zio->io_bp;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = zio->io_private;
|
2014-06-06 01:19:08 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_callback_t *callback_list;
|
|
|
|
arc_callback_t *acb;
|
2017-04-12 00:56:54 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* The hdr was inserted into hash-table and removed from lists
|
|
|
|
* prior to starting I/O. We should find this header, since
|
|
|
|
* it's in the hash table, and it should be legit since it's
|
|
|
|
* not possible to evict it during the I/O. The only possible
|
|
|
|
* reason for it not to be found is if we were freed during the
|
|
|
|
* read.
|
|
|
|
*/
|
2014-06-06 01:19:08 +04:00
|
|
|
if (HDR_IN_HASH_TABLE(hdr)) {
|
2017-11-07 21:42:15 +03:00
|
|
|
arc_buf_hdr_t *found;
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[0], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[0]);
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[1], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[1]);
|
|
|
|
|
2017-11-07 21:42:15 +03:00
|
|
|
found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock);
|
2014-06-06 01:19:08 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT((found == hdr &&
|
2014-06-06 01:19:08 +04:00
|
|
|
DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
|
|
|
|
(found == hdr && HDR_L2_READING(hdr)));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hash_lock, !=, NULL);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (BP_IS_PROTECTED(bp)) {
|
|
|
|
hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);
|
|
|
|
hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;
|
|
|
|
zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,
|
|
|
|
hdr->b_crypt_hdr.b_iv);
|
|
|
|
|
2021-12-04 00:13:21 +03:00
|
|
|
if (zio->io_error == 0) {
|
|
|
|
if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) {
|
|
|
|
void *tmpbuf;
|
|
|
|
|
|
|
|
tmpbuf = abd_borrow_buf_copy(zio->io_abd,
|
|
|
|
sizeof (zil_chain_t));
|
|
|
|
zio_crypt_decode_mac_zil(tmpbuf,
|
|
|
|
hdr->b_crypt_hdr.b_mac);
|
|
|
|
abd_return_buf(zio->io_abd, tmpbuf,
|
|
|
|
sizeof (zil_chain_t));
|
|
|
|
} else {
|
|
|
|
zio_crypt_decode_mac_bp(bp,
|
|
|
|
hdr->b_crypt_hdr.b_mac);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
if (zio->io_error == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/* byteswap if necessary */
|
|
|
|
if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
|
|
|
|
if (BP_GET_LEVEL(zio->io_bp) > 0) {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap =
|
|
|
|
DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
|
|
|
}
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
if (!HDR_L2_READING(hdr)) {
|
|
|
|
hdr->b_complevel = zio->io_prop.zp_complevel;
|
|
|
|
}
|
2014-06-06 01:19:08 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
|
2020-09-23 02:08:05 +03:00
|
|
|
if (l2arc_noprefetch && HDR_PREFETCH(hdr))
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
callback_list = hdr->b_l1hdr.b_acb;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(callback_list, !=, NULL);
|
2022-12-22 23:10:24 +03:00
|
|
|
hdr->b_l1hdr.b_acb = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If a read request has a callback (i.e. acb_done is not NULL), then we
|
|
|
|
* make a buf containing the data according to the parameters which were
|
|
|
|
* passed in. The implementation of arc_buf_alloc_impl() ensures that we
|
|
|
|
* aren't needlessly decompressing the data multiple times.
|
|
|
|
*/
|
2017-04-12 00:56:54 +03:00
|
|
|
int callback_cnt = 0;
|
2016-07-11 20:45:52 +03:00
|
|
|
for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
|
2022-12-22 23:10:24 +03:00
|
|
|
|
|
|
|
/* We need the last one to call below in original order. */
|
|
|
|
callback_list = acb;
|
|
|
|
|
2020-12-13 03:00:00 +03:00
|
|
|
if (!acb->acb_done || acb->acb_nobuf)
|
2016-07-11 20:45:52 +03:00
|
|
|
continue;
|
|
|
|
|
|
|
|
callback_cnt++;
|
2016-07-14 00:17:41 +03:00
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
if (zio->io_error != 0)
|
|
|
|
continue;
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int error = arc_buf_alloc_impl(hdr, zio->io_spa,
|
2018-05-03 01:36:20 +03:00
|
|
|
&acb->acb_zb, acb->acb_private, acb->acb_encrypted,
|
2017-11-16 04:27:01 +03:00
|
|
|
acb->acb_compressed, acb->acb_noauth, B_TRUE,
|
2017-09-28 18:49:13 +03:00
|
|
|
&acb->acb_buf);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
2017-09-28 18:49:13 +03:00
|
|
|
* Assert non-speculative zios didn't fail because an
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* encryption key wasn't loaded
|
|
|
|
*/
|
2018-03-31 21:12:51 +03:00
|
|
|
ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) ||
|
2018-05-03 01:36:20 +03:00
|
|
|
error != EACCES);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we failed to decrypt, report an error now (as the zio
|
|
|
|
* layer would have done if it had done the transforms).
|
|
|
|
*/
|
|
|
|
if (error == ECKSUM) {
|
|
|
|
ASSERT(BP_IS_PROTECTED(bp));
|
|
|
|
error = SET_ERROR(EIO);
|
|
|
|
if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) {
|
2023-03-29 02:51:58 +03:00
|
|
|
spa_log_error(zio->io_spa, &acb->acb_zb,
|
|
|
|
&zio->io_bp->blk_birth);
|
2020-09-01 05:35:11 +03:00
|
|
|
(void) zfs_ereport_post(
|
|
|
|
FM_EREPORT_ZFS_AUTHENTICATION,
|
2020-09-04 20:34:28 +03:00
|
|
|
zio->io_spa, NULL, &acb->acb_zb, zio, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-29 21:33:33 +03:00
|
|
|
if (error != 0) {
|
|
|
|
/*
|
|
|
|
* Decompression or decryption failed. Set
|
|
|
|
* io_error so that when we call acb_done
|
|
|
|
* (below), we will indicate that the read
|
|
|
|
* failed. Note that in the unusual case
|
|
|
|
* where one callback is compressed and another
|
|
|
|
* uncompressed, we will mark all of them
|
|
|
|
* as failed, even though the uncompressed
|
|
|
|
* one can't actually fail. In this case,
|
|
|
|
* the hdr will not be anonymous, because
|
|
|
|
* if there are multiple callbacks, it's
|
|
|
|
* because multiple threads found the same
|
|
|
|
* arc buf in the hash table.
|
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
zio->io_error = error;
|
2018-08-29 21:33:33 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2018-08-29 21:33:33 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If there are multiple callbacks, we must have the hash lock,
|
|
|
|
* because the only way for multiple threads to find this hdr is
|
|
|
|
* in the hash table. This ensures that if there are multiple
|
|
|
|
* callbacks, the hdr is not anonymous. If it were anonymous,
|
|
|
|
* we couldn't use arc_buf_destroy() in the error case below.
|
|
|
|
*/
|
|
|
|
ASSERT(callback_cnt < 2 || hash_lock != NULL);
|
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
if (zio->io_error == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_verify(hdr, zio->io_bp);
|
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state != arc_anon)
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_anon, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (HDR_IN_HASH_TABLE(hdr))
|
|
|
|
buf_hash_remove(hdr);
|
|
|
|
}
|
|
|
|
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
|
|
|
(void) remove_reference(hdr, hdr);
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
/* execute each callback and free its structure */
|
|
|
|
while ((acb = callback_list) != NULL) {
|
2018-08-29 21:33:33 +03:00
|
|
|
if (acb->acb_done != NULL) {
|
|
|
|
if (zio->io_error != 0 && acb->acb_buf != NULL) {
|
|
|
|
/*
|
|
|
|
* If arc_buf_alloc_impl() fails during
|
|
|
|
* decompression, the buf will still be
|
|
|
|
* allocated, and needs to be freed here.
|
|
|
|
*/
|
|
|
|
arc_buf_destroy(acb->acb_buf,
|
|
|
|
acb->acb_private);
|
|
|
|
acb->acb_buf = NULL;
|
|
|
|
}
|
2017-11-16 04:27:01 +03:00
|
|
|
acb->acb_done(zio, &zio->io_bookmark, zio->io_bp,
|
|
|
|
acb->acb_buf, acb->acb_private);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (acb->acb_zio_dummy != NULL) {
|
|
|
|
acb->acb_zio_dummy->io_error = zio->io_error;
|
|
|
|
zio_nowait(acb->acb_zio_dummy);
|
|
|
|
}
|
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
callback_list = acb->acb_prev;
|
|
|
|
if (acb->acb_wait) {
|
|
|
|
mutex_enter(&acb->acb_wait_lock);
|
|
|
|
acb->acb_wait_error = zio->io_error;
|
|
|
|
acb->acb_wait = B_FALSE;
|
|
|
|
cv_signal(&acb->acb_wait_cv);
|
|
|
|
mutex_exit(&acb->acb_wait_lock);
|
|
|
|
/* acb will be freed by the waiting thread. */
|
|
|
|
} else {
|
|
|
|
kmem_free(acb, sizeof (arc_callback_t));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-11 20:54:18 +04:00
|
|
|
* "Read" the block at the specified DVA (in bp) via the
|
2008-11-20 23:01:55 +03:00
|
|
|
* cache. If the block is found in the cache, invoke the provided
|
|
|
|
* callback immediately and return. Note that the `zio' parameter
|
|
|
|
* in the callback will be NULL in this case, since no IO was
|
|
|
|
* required. If the block is not in the cache pass the read request
|
|
|
|
* on to the spa with a substitute callback function, so that the
|
|
|
|
* requested block will be added to the cache.
|
|
|
|
*
|
|
|
|
* If a read request arrives for a block that has a read in-progress,
|
|
|
|
* either wait for the in-progress read to complete (and return the
|
|
|
|
* results); or, if this is a read with a "done" func, add a record
|
|
|
|
* to the read to invoke the "done" func when the read completes,
|
|
|
|
* and return; or just return.
|
|
|
|
*
|
|
|
|
* arc_read_done() will invoke all the requested "done" functions
|
|
|
|
* for readers of this block.
|
|
|
|
*/
|
|
|
|
int
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
|
|
|
|
arc_read_done_func_t *done, void *private, zio_priority_t priority,
|
|
|
|
int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *hdr = NULL;
|
|
|
|
kmutex_t *hash_lock = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *rzio;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t guid = spa_load_guid(spa);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0;
|
|
|
|
boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) &&
|
|
|
|
(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;
|
|
|
|
boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) &&
|
|
|
|
(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;
|
2019-02-04 20:33:30 +03:00
|
|
|
boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp);
|
2020-12-10 02:05:06 +03:00
|
|
|
boolean_t no_buf = *arc_flags & ARC_FLAG_NO_BUF;
|
2023-02-14 00:21:53 +03:00
|
|
|
arc_buf_t *buf = NULL;
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
int rc = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-02-04 20:33:30 +03:00
|
|
|
ASSERT(!embedded_bp ||
|
2014-06-06 01:19:08 +04:00
|
|
|
BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(!BP_IS_HOLE(bp));
|
|
|
|
ASSERT(!BP_IS_REDACTED(bp));
|
2014-06-06 01:19:08 +04:00
|
|
|
|
2020-03-12 20:24:43 +03:00
|
|
|
/*
|
|
|
|
* Normally SPL_FSTRANS will already be set since kernel threads which
|
|
|
|
* expect to call the DMU interfaces will set it when created. System
|
|
|
|
* calls are similarly handled by setting/cleaning the bit in the
|
|
|
|
* registered callback (module/os/.../zfs/zpl_*).
|
|
|
|
*
|
|
|
|
* External consumers such as Lustre which call the exported DMU
|
|
|
|
* interfaces may not have set SPL_FSTRANS. To avoid a deadlock
|
|
|
|
* on the hash_lock always set and clear the bit.
|
|
|
|
*/
|
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
top:
|
2021-09-10 04:02:07 +03:00
|
|
|
/*
|
|
|
|
* Verify the block pointer contents are reasonable. This should
|
|
|
|
* always be the case since the blkptr is protected by a checksum.
|
|
|
|
* However, if there is damage it's desirable to detect this early
|
|
|
|
* and treat it as a checksum error. This allows an alternate blkptr
|
|
|
|
* to be tried when one is available (e.g. ditto blocks).
|
|
|
|
*/
|
Verify block pointers before writing them out
If a block pointer is corrupted (but the block containing it checksums
correctly, e.g. due to a bug that overwrites random memory), we can
often detect it before the block is read, with the `zfs_blkptr_verify()`
function, which is used in `arc_read()`, `zio_free()`, etc.
However, such corruption is not typically recoverable. To recover from
it we would need to detect the memory error before the block pointer is
written to disk.
This PR verifies BP's that are contained in indirect blocks and dnodes
before they are written to disk, in `dbuf_write_ready()`. This way,
we'll get a panic before the on-disk data is corrupted. This will help
us to diagnose what's causing the corruption, as well as being much
easier to recover from.
To minimize performance impact, only checks that can be done without
holding the spa_config_lock are performed.
Additionally, when corruption is detected, the raw words of the block
pointer are logged. (Note that `dprintf_bp()` is a no-op by default,
but if enabled it is not safe to use with invalid block pointers.)
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14817
2023-05-08 21:20:23 +03:00
|
|
|
if (!zfs_blkptr_verify(spa, bp, (zio_flags & ZIO_FLAG_CONFIG_WRITER) ?
|
|
|
|
BLK_CONFIG_HELD : BLK_CONFIG_NEEDED, BLK_VERIFY_LOG)) {
|
2021-09-10 04:02:07 +03:00
|
|
|
rc = SET_ERROR(ECKSUM);
|
2023-02-14 00:21:53 +03:00
|
|
|
goto done;
|
2021-09-10 04:02:07 +03:00
|
|
|
}
|
|
|
|
|
2019-02-04 20:33:30 +03:00
|
|
|
if (!embedded_bp) {
|
2014-06-06 01:19:08 +04:00
|
|
|
/*
|
|
|
|
* Embedded BP's have no DVA and require no I/O to "read".
|
|
|
|
* Create an anonymous arc buf to back it.
|
|
|
|
*/
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Determine if we have an L1 cache hit or a cache miss. For simplicity
|
2019-09-03 03:56:41 +03:00
|
|
|
* we maintain encrypted data separately from compressed / uncompressed
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* data. If the user is requesting raw encrypted data and we don't have
|
|
|
|
* that in the header we will read from disk to guarantee that we can
|
|
|
|
* get it even if the encryption keys aren't loaded.
|
|
|
|
*/
|
|
|
|
if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) ||
|
|
|
|
(hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) {
|
2022-12-22 23:10:24 +03:00
|
|
|
boolean_t is_data = !HDR_ISTYPE_METADATA(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 20:51:04 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_CACHED_ONLY) {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_cached_only_in_progress);
|
|
|
|
rc = SET_ERROR(ENOENT);
|
2023-02-14 00:21:53 +03:00
|
|
|
goto done;
|
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 20:51:04 +03:00
|
|
|
}
|
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head;
|
2017-12-21 20:13:06 +03:00
|
|
|
ASSERT3P(head_zio, !=, NULL);
|
2015-12-27 00:10:31 +03:00
|
|
|
if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
|
|
|
|
priority == ZIO_PRIORITY_SYNC_READ) {
|
|
|
|
/*
|
2017-12-21 20:13:06 +03:00
|
|
|
* This is a sync read that needs to wait for
|
|
|
|
* an in-flight async read. Request that the
|
|
|
|
* zio have its priority upgraded.
|
2015-12-27 00:10:31 +03:00
|
|
|
*/
|
2017-12-21 20:13:06 +03:00
|
|
|
zio_change_priority(head_zio, priority);
|
|
|
|
DTRACE_PROBE1(arc__async__upgrade__sync,
|
2015-12-27 00:10:31 +03:00
|
|
|
arc_buf_hdr_t *, hdr);
|
2017-12-21 20:13:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_async_upgrade_sync);
|
2015-12-27 00:10:31 +03:00
|
|
|
}
|
2022-12-22 23:10:24 +03:00
|
|
|
|
|
|
|
DTRACE_PROBE1(arc__iohit, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_access(hdr, *arc_flags, B_FALSE);
|
2015-12-27 00:10:31 +03:00
|
|
|
|
2022-08-09 02:55:13 +03:00
|
|
|
/*
|
|
|
|
* If there are multiple threads reading the same block
|
|
|
|
* and that block is not yet in the ARC, then only one
|
|
|
|
* thread will do the physical I/O and all other
|
|
|
|
* threads will wait until that I/O completes.
|
2022-12-22 23:10:24 +03:00
|
|
|
* Synchronous reads use the acb_wait_cv whereas nowait
|
|
|
|
* reads register a callback. Both are signalled/called
|
|
|
|
* in arc_read_done.
|
2022-08-09 02:55:13 +03:00
|
|
|
*
|
2022-12-22 23:10:24 +03:00
|
|
|
* Errors of the physical I/O may need to be propagated.
|
|
|
|
* Synchronous read errors are returned here from
|
|
|
|
* arc_read_done via acb_wait_error. Nowait reads
|
2022-08-09 02:55:13 +03:00
|
|
|
* attach the acb_zio_dummy zio to pio and
|
|
|
|
* arc_read_done propagates the physical I/O's io_error
|
|
|
|
* to acb_zio_dummy, and thereby to pio.
|
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_callback_t *acb = NULL;
|
|
|
|
if (done || pio || *arc_flags & ARC_FLAG_WAIT) {
|
2008-11-20 23:01:55 +03:00
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
2017-04-12 00:56:54 +03:00
|
|
|
acb->acb_compressed = compressed_read;
|
2017-09-28 18:49:13 +03:00
|
|
|
acb->acb_encrypted = encrypted_read;
|
|
|
|
acb->acb_noauth = noauth_read;
|
2020-12-13 03:00:00 +03:00
|
|
|
acb->acb_nobuf = no_buf;
|
2022-12-22 23:10:24 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_WAIT) {
|
|
|
|
acb->acb_wait = B_TRUE;
|
|
|
|
mutex_init(&acb->acb_wait_lock, NULL,
|
|
|
|
MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&acb->acb_wait_cv, NULL,
|
|
|
|
CV_DEFAULT, NULL);
|
|
|
|
}
|
2018-05-03 01:36:20 +03:00
|
|
|
acb->acb_zb = *zb;
|
2022-12-22 23:10:24 +03:00
|
|
|
if (pio != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_zio_dummy = zio_null(pio,
|
2009-02-18 23:51:31 +03:00
|
|
|
spa, NULL, NULL, NULL, zio_flags);
|
2022-12-22 23:10:24 +03:00
|
|
|
}
|
2017-12-21 20:13:06 +03:00
|
|
|
acb->acb_zio_head = head_zio;
|
2014-12-30 06:12:23 +03:00
|
|
|
acb->acb_next = hdr->b_l1hdr.b_acb;
|
2023-10-05 00:45:00 +03:00
|
|
|
hdr->b_l1hdr.b_acb->acb_prev = acb;
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_acb = acb;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
2022-12-22 23:10:24 +03:00
|
|
|
|
|
|
|
ARCSTAT_BUMP(arcstat_iohits);
|
|
|
|
ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
|
|
|
|
demand, prefetch, is_data, data, metadata, iohits);
|
|
|
|
|
|
|
|
if (*arc_flags & ARC_FLAG_WAIT) {
|
|
|
|
mutex_enter(&acb->acb_wait_lock);
|
|
|
|
while (acb->acb_wait) {
|
|
|
|
cv_wait(&acb->acb_wait_cv,
|
|
|
|
&acb->acb_wait_lock);
|
|
|
|
}
|
|
|
|
rc = acb->acb_wait_error;
|
|
|
|
mutex_exit(&acb->acb_wait_lock);
|
|
|
|
mutex_destroy(&acb->acb_wait_lock);
|
|
|
|
cv_destroy(&acb->acb_wait_cv);
|
|
|
|
kmem_free(acb, sizeof (arc_callback_t));
|
|
|
|
}
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
hdr->b_l1hdr.b_state == arc_mfu ||
|
|
|
|
hdr->b_l1hdr.b_state == arc_uncached);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_access(hdr, *arc_flags, B_TRUE);
|
2017-11-16 04:27:01 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
if (done && !no_buf) {
|
2019-02-04 20:33:30 +03:00
|
|
|
ASSERT(!embedded_bp || !BP_IS_HOLE(bp));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Get a buf with the desired data in it. */
|
2018-05-03 01:36:20 +03:00
|
|
|
rc = arc_buf_alloc_impl(hdr, spa, zb, private,
|
|
|
|
encrypted_read, compressed_read, noauth_read,
|
|
|
|
B_TRUE, &buf);
|
2018-03-31 21:12:51 +03:00
|
|
|
if (rc == ECKSUM) {
|
|
|
|
/*
|
|
|
|
* Convert authentication and decryption errors
|
2018-05-03 01:36:20 +03:00
|
|
|
* to EIO (and generate an ereport if needed)
|
|
|
|
* before leaving the ARC.
|
2018-03-31 21:12:51 +03:00
|
|
|
*/
|
|
|
|
rc = SET_ERROR(EIO);
|
2018-05-03 01:36:20 +03:00
|
|
|
if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) {
|
2023-03-29 02:51:58 +03:00
|
|
|
spa_log_error(spa, zb, &hdr->b_birth);
|
2020-09-01 05:35:11 +03:00
|
|
|
(void) zfs_ereport_post(
|
2018-05-03 01:36:20 +03:00
|
|
|
FM_EREPORT_ZFS_AUTHENTICATION,
|
2020-09-04 20:34:28 +03:00
|
|
|
spa, NULL, zb, NULL, 0);
|
2018-05-03 01:36:20 +03:00
|
|
|
}
|
2018-03-31 21:12:51 +03:00
|
|
|
}
|
2017-11-16 04:27:01 +03:00
|
|
|
if (rc != 0) {
|
2018-05-01 21:24:20 +03:00
|
|
|
arc_buf_destroy_impl(buf);
|
2017-11-16 04:27:01 +03:00
|
|
|
buf = NULL;
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
(void) remove_reference(hdr, private);
|
2017-11-16 04:27:01 +03:00
|
|
|
}
|
|
|
|
|
2018-03-31 21:12:51 +03:00
|
|
|
/* assert any errors weren't due to unloaded keys */
|
|
|
|
ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) ||
|
2018-05-03 01:36:20 +03:00
|
|
|
rc != EACCES);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
|
|
|
|
demand, prefetch, is_data, data, metadata, hits);
|
|
|
|
*arc_flags |= ARC_FLAG_CACHED;
|
2023-02-14 00:21:53 +03:00
|
|
|
goto done;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t lsize = BP_GET_LSIZE(bp);
|
|
|
|
uint64_t psize = BP_GET_PSIZE(bp);
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_callback_t *acb;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd = NULL;
|
2013-02-11 10:21:05 +04:00
|
|
|
uint64_t addr = 0;
|
2009-02-18 23:51:31 +03:00
|
|
|
boolean_t devw = B_FALSE;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t size;
|
2017-09-28 18:49:13 +03:00
|
|
|
abd_t *hdr_abd;
|
2020-08-12 20:03:24 +03:00
|
|
|
int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0;
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 20:51:04 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_CACHED_ONLY) {
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2023-02-14 00:21:53 +03:00
|
|
|
rc = SET_ERROR(ENOENT);
|
|
|
|
goto done;
|
Improve zfs send performance by bypassing the ARC
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with `zfs send --compressed`, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the `send_prefetch` thread, which is 100% on
CPU.
The main job of the `send_prefetch` thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
`arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
* Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to
look up the arc_hdr in the hash table and copy the data from the
scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes
27% of one CPU.
* dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one
CPU.
* arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes `zfs send` to check for the
data in the ARC, and if it is not found then we directly call
`zio_read()`, reading the data into a linear ABD which is used by
dmu_dump_write() directly.
The performance improvement is best expressed in terms of how many
blocks can be processed by `zfs send` in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes
to 58 minutes in this test case).
In addition to improving the performance of `zfs send`, this change
makes `zfs send` not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10067
2020-03-10 20:51:04 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (hdr == NULL) {
|
2019-02-04 20:33:30 +03:00
|
|
|
/*
|
|
|
|
* This block is not in the cache or it has
|
|
|
|
* embedded data.
|
|
|
|
*/
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *exists = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
|
2021-08-17 19:15:54 +03:00
|
|
|
BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2019-02-04 20:33:30 +03:00
|
|
|
if (!embedded_bp) {
|
2014-06-06 01:19:08 +04:00
|
|
|
hdr->b_dva = *BP_IDENTITY(bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
}
|
|
|
|
if (exists != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/* somebody beat us to the hash insert */
|
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
goto top; /* restart the IO request */
|
|
|
|
}
|
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* This block is in the ghost cache or encrypted data
|
|
|
|
* was requested and we didn't have it. If it was
|
|
|
|
* L2-only (and thus didn't have an L1 hdr),
|
|
|
|
* we realloc the header to add an L1 hdr.
|
2014-12-30 06:12:23 +03:00
|
|
|
*/
|
|
|
|
if (!HDR_HAS_L1HDR(hdr)) {
|
|
|
|
hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
|
|
|
|
hdr_full_cache);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (GHOST_STATE(hdr->b_l1hdr.b_state)) {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT0(zfs_refcount_count(
|
|
|
|
&hdr->b_l1hdr.b_refcnt));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
2023-01-05 20:15:31 +03:00
|
|
|
#endif
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
} else if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
/*
|
|
|
|
* If this header already had an IO in progress
|
|
|
|
* and we are performing another IO to fetch
|
|
|
|
* encrypted data we must wait until the first
|
|
|
|
* IO completes so as not to confuse
|
|
|
|
* arc_read_done(). This should be very rare
|
|
|
|
* and so the performance impact shouldn't
|
|
|
|
* matter.
|
|
|
|
*/
|
2023-10-05 00:45:00 +03:00
|
|
|
arc_callback_t *acb = kmem_zalloc(
|
|
|
|
sizeof (arc_callback_t), KM_SLEEP);
|
|
|
|
acb->acb_wait = B_TRUE;
|
|
|
|
mutex_init(&acb->acb_wait_lock, NULL,
|
|
|
|
MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&acb->acb_wait_cv, NULL, CV_DEFAULT,
|
|
|
|
NULL);
|
|
|
|
acb->acb_zio_head =
|
|
|
|
hdr->b_l1hdr.b_acb->acb_zio_head;
|
|
|
|
acb->acb_next = hdr->b_l1hdr.b_acb;
|
|
|
|
hdr->b_l1hdr.b_acb->acb_prev = acb;
|
|
|
|
hdr->b_l1hdr.b_acb = acb;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
mutex_exit(hash_lock);
|
2023-10-05 00:45:00 +03:00
|
|
|
mutex_enter(&acb->acb_wait_lock);
|
|
|
|
while (acb->acb_wait) {
|
|
|
|
cv_wait(&acb->acb_wait_cv,
|
|
|
|
&acb->acb_wait_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&acb->acb_wait_lock);
|
|
|
|
mutex_destroy(&acb->acb_wait_lock);
|
|
|
|
cv_destroy(&acb->acb_wait_cv);
|
|
|
|
kmem_free(acb, sizeof (arc_callback_t));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
goto top;
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_UNCACHED) {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);
|
|
|
|
if (!encrypted_read)
|
|
|
|
alloc_flags |= ARC_HDR_ALLOC_LINEAR;
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
/*
|
|
|
|
* Take additional reference for IO_IN_PROGRESS. It stops
|
|
|
|
* arc_access() from putting this header without any buffers
|
|
|
|
* and so other references but obviously nonevictable onto
|
|
|
|
* the evictable list of MRU or MFU state.
|
|
|
|
*/
|
|
|
|
add_reference(hdr, hdr);
|
|
|
|
if (!embedded_bp)
|
|
|
|
arc_access(hdr, *arc_flags, B_FALSE);
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
2021-08-17 19:15:54 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, alloc_flags);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (encrypted_read) {
|
|
|
|
ASSERT(HDR_HAS_RABD(hdr));
|
|
|
|
size = HDR_GET_PSIZE(hdr);
|
|
|
|
hdr_abd = hdr->b_crypt_hdr.b_rabd;
|
2016-06-02 07:04:53 +03:00
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
size = arc_hdr_size(hdr);
|
|
|
|
hdr_abd = hdr->b_l1hdr.b_pabd;
|
|
|
|
|
|
|
|
if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {
|
|
|
|
zio_flags |= ZIO_FLAG_RAW_COMPRESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For authenticated bp's, we do not ask the ZIO layer
|
|
|
|
* to authenticate them since this will cause the entire
|
|
|
|
* IO to fail if the key isn't loaded. Instead, we
|
|
|
|
* defer authentication until arc_buf_fill(), which will
|
|
|
|
* verify the data when the key is available.
|
|
|
|
*/
|
|
|
|
if (BP_IS_AUTHENTICATED(bp))
|
|
|
|
zio_flags |= ZIO_FLAG_RAW_ENCRYPT;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (BP_IS_AUTHENTICATED(bp))
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);
|
2016-06-02 07:04:53 +03:00
|
|
|
if (BP_GET_LEVEL(bp) > 0)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
2016-07-11 20:45:52 +03:00
|
|
|
acb->acb_compressed = compressed_read;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
acb->acb_encrypted = encrypted_read;
|
|
|
|
acb->acb_noauth = noauth_read;
|
2018-05-03 01:36:20 +03:00
|
|
|
acb->acb_zb = *zb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_acb = acb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) &&
|
|
|
|
(vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
|
|
|
|
devw = hdr->b_l2hdr.b_dev->l2ad_writing;
|
|
|
|
addr = hdr->b_l2hdr.b_daddr;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
* Lock out L2ARC device removal.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
|
|
|
if (vdev_is_dead(vd) ||
|
|
|
|
!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
|
|
|
|
vd = NULL;
|
|
|
|
}
|
|
|
|
|
2017-12-21 20:13:06 +03:00
|
|
|
/*
|
|
|
|
* We count both async reads and scrub IOs as asynchronous so
|
|
|
|
* that both can be upgraded in the event of a cache hit while
|
|
|
|
* the read IO is still in-flight.
|
|
|
|
*/
|
|
|
|
if (priority == ZIO_PRIORITY_ASYNC_READ ||
|
|
|
|
priority == ZIO_PRIORITY_SCRUB)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
|
|
|
|
else
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
2019-02-04 20:33:30 +03:00
|
|
|
* At this point, we have a level 1 cache miss or a blkptr
|
|
|
|
* with embedded data. Try again in L2ARC if possible.
|
2013-06-11 21:12:34 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
|
|
|
|
|
2019-02-04 20:33:30 +03:00
|
|
|
/*
|
|
|
|
* Skip ARC stat bump for block pointers with embedded
|
|
|
|
* data. The data are read from the blkptr itself via
|
|
|
|
* decode_embedded_bp_compressed().
|
|
|
|
*/
|
|
|
|
if (!embedded_bp) {
|
|
|
|
DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr,
|
|
|
|
blkptr_t *, bp, uint64_t, lsize,
|
|
|
|
zbookmark_phys_t *, zb);
|
|
|
|
ARCSTAT_BUMP(arcstat_misses);
|
2022-12-22 23:10:24 +03:00
|
|
|
ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),
|
2019-02-04 20:33:30 +03:00
|
|
|
demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data,
|
|
|
|
metadata, misses);
|
2021-02-20 09:34:33 +03:00
|
|
|
zfs_racct_read(size, 1);
|
2019-02-04 20:33:30 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-10-20 21:39:52 +03:00
|
|
|
/* Check if the spa even has l2 configured */
|
|
|
|
const boolean_t spa_has_l2 = l2arc_ndev != 0 &&
|
|
|
|
spa->spa_l2cache.sav_count > 0;
|
|
|
|
|
|
|
|
if (vd != NULL && spa_has_l2 && !(l2arc_norw && devw)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Read from the L2ARC if the following are true:
|
2008-12-03 23:09:06 +03:00
|
|
|
* 1. The L2ARC vdev was previously cached.
|
|
|
|
* 2. This buffer still has L2ARC metadata.
|
|
|
|
* 3. This buffer isn't currently writing to the L2ARC.
|
|
|
|
* 4. The L2ARC entry wasn't evicted, which may
|
|
|
|
* also have invalidated the vdev.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) &&
|
Read prefetched buffers from L2ARC
Prefetched buffers are currently read from L2ARC if, and only if,
l2arc_noprefetch is set to non-default value of 0. This means that
a streaming read which can be served from L2ARC will instead engage
the main pool.
For example, consider what happens when a file is sequentially read:
- application requests contiguous data, engaging the prefetcher;
- ARC buffers are initially marked as prefetched but, as the calling
application consumes data, the prefetch tag is cleared;
- these "normal" buffers become eligible for L2ARC and are copied to it;
- re-reading the same file will *not* engage L2ARC even if it contains
the required buffers;
- main pool has to suffer another sequential read load, which (due to
most NCQ-enabled HDDs preferring sequential loads) can dramatically
increase latency for uncached random reads.
In other words, current behavior is to write data to L2ARC (wearing it)
without using the very same cache when reading back the same data. This
was probably useful many years ago to preserve L2ARC read bandwidth but,
with current SSD speed/size/price, it is vastly sub-optimal.
Setting l2arc_noprefetch=1, while enabling L2ARC to serve these reads,
means that even prefetched but unused buffers will be copied into L2ARC,
further increasing wear and load for potentially not-useful data.
This patch enable prefetched buffer to be read from L2ARC even when
l2arc_noprefetch=1 (default), increasing sequential read speed and
reducing load on the main pool without polluting L2ARC with not-useful
(ie: unused) prefetched data. Moreover, it clear users confusion about
L2ARC size increasing but not serving any IO when doing sequential
reads.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #15451
2023-10-26 19:40:21 +03:00
|
|
|
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_read_callback_t *cb;
|
2017-06-27 03:32:43 +03:00
|
|
|
abd_t *abd;
|
|
|
|
uint64_t asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_hits);
|
2021-08-17 18:50:31 +03:00
|
|
|
hdr->b_l2hdr.b_hits++;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2016-06-02 07:04:53 +03:00
|
|
|
cb->l2rcb_hdr = hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2rcb_bp = *bp;
|
|
|
|
cb->l2rcb_zb = *zb;
|
2008-12-03 23:09:06 +03:00
|
|
|
cb->l2rcb_flags = zio_flags;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Fix L2ARC reads when compressed ARC disabled
When reading compressed blocks from the L2ARC, with
compressed ARC disabled, arc_hdr_size() returns
LSIZE rather than PSIZE, but the actual read is PSIZE.
This causes l2arc_read_done() to compare the checksum
against the wrong size, resulting in checksum failure.
This manifests as an increase in the kstat l2_cksum_bad
and the read being retried from the main pool, making the
L2ARC ineffective.
Add new L2ARC tests with Compressed ARC enabled/disabled
Blocks are handled differently depending on the state of the
zfs_compressed_arc_enabled tunable.
If a block is compressed on-disk, and compressed_arc is enabled:
- the block is read from disk
- It is NOT decompressed
- It is added to the ARC in its compressed form
- l2arc_write_buffers() may write it to the L2ARC (as is)
- l2arc_read_done() compares the checksum to the BP (compressed)
However, if compressed_arc is disabled:
- the block is read from disk
- It is decompressed
- It is added to the ARC (uncompressed)
- l2arc_write_buffers() will use l2arc_apply_transforms() to
recompress the block, before writing it to the L2ARC
- l2arc_read_done() compares the checksum to the BP (compressed)
- l2arc_read_done() will use l2arc_untransform() to uncompress it
This test writes out a test file to a pool consisting of one disk
and one cache device, then randomly reads from it. Since the arc_max
in the tests is low, this will feed the L2ARC, and result in reads
from the L2ARC.
We compare the value of the kstat l2_cksum_bad before and after
to determine if any blocks failed to survive the trip through the
L2ARC.
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #10693
2020-08-14 09:31:20 +03:00
|
|
|
/*
|
|
|
|
* When Compressed ARC is disabled, but the
|
|
|
|
* L2ARC block is compressed, arc_hdr_size()
|
|
|
|
* will have returned LSIZE rather than PSIZE.
|
|
|
|
*/
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!HDR_COMPRESSION_ENABLED(hdr) &&
|
|
|
|
HDR_GET_PSIZE(hdr) != 0) {
|
|
|
|
size = HDR_GET_PSIZE(hdr);
|
|
|
|
}
|
|
|
|
|
2017-06-27 03:32:43 +03:00
|
|
|
asize = vdev_psize_to_asize(vd, size);
|
|
|
|
if (asize != size) {
|
|
|
|
abd = abd_alloc_for_io(asize,
|
|
|
|
HDR_ISTYPE_METADATA(hdr));
|
|
|
|
cb->l2rcb_abd = abd;
|
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd = hdr_abd;
|
2017-06-27 03:32:43 +03:00
|
|
|
}
|
|
|
|
|
2013-02-11 10:21:05 +04:00
|
|
|
ASSERT(addr >= VDEV_LABEL_START_SIZE &&
|
2017-06-27 03:32:43 +03:00
|
|
|
addr + asize <= vd->vdev_psize -
|
2013-02-11 10:21:05 +04:00
|
|
|
VDEV_LABEL_END_SIZE);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc read. The SCL_L2ARC lock will be
|
|
|
|
* released by l2arc_read_done().
|
2013-08-02 00:02:10 +04:00
|
|
|
* Issue a null zio if the underlying buffer
|
|
|
|
* was squashed to zero size by compression.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(arc_hdr_get_compress(hdr), !=,
|
2016-06-02 07:04:53 +03:00
|
|
|
ZIO_COMPRESS_EMPTY);
|
|
|
|
rzio = zio_read_phys(pio, vd, addr,
|
2017-06-27 03:32:43 +03:00
|
|
|
asize, abd,
|
2016-06-02 07:04:53 +03:00
|
|
|
ZIO_CHECKSUM_OFF,
|
|
|
|
l2arc_read_done, cb, priority,
|
2023-06-09 22:40:55 +03:00
|
|
|
zio_flags | ZIO_FLAG_CANFAIL |
|
2016-06-02 07:04:53 +03:00
|
|
|
ZIO_FLAG_DONT_PROPAGATE |
|
|
|
|
ZIO_FLAG_DONT_RETRY, B_FALSE);
|
2017-12-21 20:13:06 +03:00
|
|
|
acb->acb_zio_head = rzio;
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
|
|
|
|
zio_t *, rzio);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_read_bytes,
|
|
|
|
HDR_GET_PSIZE(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_NOWAIT) {
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_nowait(rzio);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(*arc_flags & ARC_FLAG_WAIT);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio_wait(rzio) == 0)
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* l2arc read error; goto zio_read() */
|
2017-12-21 20:13:06 +03:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_enter(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
if (HDR_L2_WRITING(hdr))
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rw_clash);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
} else {
|
|
|
|
if (vd != NULL)
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
2020-10-20 21:39:52 +03:00
|
|
|
|
2019-02-04 20:33:30 +03:00
|
|
|
/*
|
2020-10-20 21:39:52 +03:00
|
|
|
* Only a spa with l2 should contribute to l2
|
|
|
|
* miss stats. (Including the case of having a
|
|
|
|
* faulted cache device - that's also a miss.)
|
2019-02-04 20:33:30 +03:00
|
|
|
*/
|
2020-10-20 21:39:52 +03:00
|
|
|
if (spa_has_l2) {
|
|
|
|
/*
|
|
|
|
* Skip ARC stat bump for block pointers with
|
|
|
|
* embedded data. The data are read from the
|
|
|
|
* blkptr itself via
|
|
|
|
* decode_embedded_bp_compressed().
|
|
|
|
*/
|
|
|
|
if (!embedded_bp) {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
rzio = zio_read(pio, spa, bp, hdr_abd, size,
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_read_done, hdr, priority, zio_flags, zb);
|
2017-12-21 20:13:06 +03:00
|
|
|
acb->acb_zio_head = rzio;
|
|
|
|
|
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_WAIT) {
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
rc = zio_wait(rzio);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_nowait(rzio);
|
|
|
|
}
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
|
|
|
|
out:
|
2018-03-24 07:35:19 +03:00
|
|
|
/* embedded bps don't actually go to disk */
|
2019-02-04 20:33:30 +03:00
|
|
|
if (!embedded_bp)
|
2018-03-24 07:35:19 +03:00
|
|
|
spa_read_history_add(spa, zb, *arc_flags);
|
2020-03-12 20:24:43 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
return (rc);
|
2023-02-14 00:21:53 +03:00
|
|
|
|
|
|
|
done:
|
|
|
|
if (done)
|
|
|
|
done(NULL, zb, bp, buf, private);
|
|
|
|
if (pio && rc != 0) {
|
|
|
|
zio_t *zio = zio_null(pio, spa, NULL, NULL, NULL, zio_flags);
|
|
|
|
zio->io_error = rc;
|
|
|
|
zio_nowait(zio);
|
|
|
|
}
|
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *
|
|
|
|
arc_add_prune_callback(arc_prune_func_t *func, void *private)
|
|
|
|
{
|
|
|
|
arc_prune_t *p;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
p = kmem_alloc(sizeof (*p), KM_SLEEP);
|
2011-12-23 00:20:43 +04:00
|
|
|
p->p_pfunc = func;
|
|
|
|
p->p_private = private;
|
|
|
|
list_link_init(&p->p_node);
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&p->p_refcnt);
|
2011-12-23 00:20:43 +04:00
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
2018-09-26 20:29:26 +03:00
|
|
|
zfs_refcount_add(&p->p_refcnt, &arc_prune_list);
|
2011-12-23 00:20:43 +04:00
|
|
|
list_insert_head(&arc_prune_list, p);
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
return (p);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_remove_prune_callback(arc_prune_t *p)
|
|
|
|
{
|
2016-05-23 21:58:21 +03:00
|
|
|
boolean_t wait = B_FALSE;
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
list_remove(&arc_prune_list, p);
|
2018-10-01 20:42:05 +03:00
|
|
|
if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0)
|
2016-05-23 21:58:21 +03:00
|
|
|
wait = B_TRUE;
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_exit(&arc_prune_mtx);
|
2016-05-23 21:58:21 +03:00
|
|
|
|
|
|
|
/* wait for arc_prune_task to finish */
|
|
|
|
if (wait)
|
|
|
|
taskq_wait_outstanding(arc_prune_taskq, 0);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT0(zfs_refcount_count(&p->p_refcnt));
|
|
|
|
zfs_refcount_destroy(&p->p_refcnt);
|
2016-05-23 21:58:21 +03:00
|
|
|
kmem_free(p, sizeof (*p));
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
|
2023-10-31 02:56:04 +03:00
|
|
|
/*
|
|
|
|
* Helper function for arc_prune_async() it is responsible for safely
|
|
|
|
* handling the execution of a registered arc_prune_func_t.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_prune_task(void *ptr)
|
|
|
|
{
|
|
|
|
arc_prune_t *ap = (arc_prune_t *)ptr;
|
|
|
|
arc_prune_func_t *func = ap->p_pfunc;
|
|
|
|
|
|
|
|
if (func != NULL)
|
|
|
|
func(ap->p_adjust, ap->p_private);
|
|
|
|
|
|
|
|
zfs_refcount_remove(&ap->p_refcnt, func);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Notify registered consumers they must drop holds on a portion of the ARC
|
|
|
|
* buffers they reference. This provides a mechanism to ensure the ARC can
|
|
|
|
* honor the metadata limit and reclaim otherwise pinned ARC buffers.
|
|
|
|
*
|
|
|
|
* This operation is performed asynchronously so it may be safely called
|
|
|
|
* in the context of the arc_reclaim_thread(). A reference is taken here
|
|
|
|
* for each registered arc_prune_t and the arc_prune_task() is responsible
|
|
|
|
* for releasing it once the registered arc_prune_func_t has completed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_prune_async(uint64_t adjust)
|
|
|
|
{
|
|
|
|
arc_prune_t *ap;
|
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
for (ap = list_head(&arc_prune_list); ap != NULL;
|
|
|
|
ap = list_next(&arc_prune_list, ap)) {
|
|
|
|
|
|
|
|
if (zfs_refcount_count(&ap->p_refcnt) >= 2)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
zfs_refcount_add(&ap->p_refcnt, ap->p_pfunc);
|
|
|
|
ap->p_adjust = adjust;
|
|
|
|
if (taskq_dispatch(arc_prune_taskq, arc_prune_task,
|
|
|
|
ap, TQ_SLEEP) == TASKQID_INVALID) {
|
|
|
|
zfs_refcount_remove(&ap->p_refcnt, ap->p_pfunc);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ARCSTAT_BUMP(arcstat_prune);
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
}
|
|
|
|
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
/*
|
|
|
|
* Notify the arc that a block was freed, and thus will never be used again.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_freed(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t guid = spa_load_guid(spa);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
|
|
|
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
if (hdr == NULL)
|
|
|
|
return;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* We might be trying to free a block that is still doing I/O
|
2022-12-22 23:10:24 +03:00
|
|
|
* (i.e. prefetch) or has some other reference (i.e. a dedup-ed,
|
|
|
|
* dmu_sync-ed block). A block may also have a reference if it is
|
2016-06-02 07:04:53 +03:00
|
|
|
* part of a dedup-ed, dmu_synced write. The dmu_sync() function would
|
|
|
|
* have written the new block to its final resting place on disk but
|
|
|
|
* without the dedup flag set. This would have left the hdr in the MRU
|
|
|
|
* state and discoverable. When the txg finally syncs it detects that
|
|
|
|
* the block was overridden in open context and issues an override I/O.
|
|
|
|
* Since this is a dedup block, the override I/O will determine if the
|
|
|
|
* block is already in the DDT. If so, then it will replace the io_bp
|
|
|
|
* with the bp from the DDT and allow the I/O to finish. When the I/O
|
|
|
|
* reaches the done callback, dbuf_write_override_done, it will
|
|
|
|
* check to see if the io_bp and io_bp_override are identical.
|
|
|
|
* If they are not, then it indicates that the bp was replaced with
|
|
|
|
* the bp in the DDT and the override bp is freed. This allows
|
|
|
|
* us to arrive here with a reference on a block that is being
|
|
|
|
* freed. So if we have an I/O in progress, or a reference to
|
|
|
|
* this hdr, then we don't destroy the hdr.
|
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
if (!HDR_HAS_L1HDR(hdr) ||
|
|
|
|
zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
|
|
|
|
arc_change_state(arc_anon, hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
mutex_exit(hash_lock);
|
2014-07-15 11:43:18 +04:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Release this buffer from the cache, making it an anonymous buffer. This
|
|
|
|
* must be done after a read and prior to modifying the buffer contents.
|
2008-11-20 23:01:55 +03:00
|
|
|
* If the buffer has more than one reference, we must make
|
2008-12-03 23:09:06 +03:00
|
|
|
* a new hdr for the buffer.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
void
|
2022-04-19 21:49:30 +03:00
|
|
|
arc_release(arc_buf_t *buf, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* It would be nice to assert that if its DMU metadata (level >
|
2010-05-29 00:45:14 +04:00
|
|
|
* 0 || it's the dnode file), then it must be syncing context.
|
|
|
|
* But we don't know that information at this level.
|
|
|
|
*/
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
|
|
|
* We don't grab the hash lock prior to this check, because if
|
|
|
|
* the buffer's header is in the arc_anon state, it won't be
|
|
|
|
* linked into the hash table.
|
|
|
|
*/
|
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
|
|
|
ASSERT(!HDR_HAS_L2HDR(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);
|
|
|
|
ASSERT(ARC_BUF_LAST(buf));
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
|
2022-12-22 23:10:24 +03:00
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the buf is being overridden then it may already
|
|
|
|
* have a hdr that is not empty.
|
|
|
|
*/
|
|
|
|
buf_discard_identity(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
kmutex_t *hash_lock = HDR_LOCK(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This assignment is only valid as long as the hash_lock is
|
|
|
|
* held, we must be careful not to reference state or the
|
|
|
|
* b_state field after dropping the lock.
|
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
|
|
|
ASSERT3P(state, !=, arc_anon);
|
|
|
|
|
|
|
|
/* this buffer is not on any list */
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
|
|
|
mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
2015-06-16 02:12:19 +03:00
|
|
|
* We have to recheck this conditional again now that
|
|
|
|
* we're holding the l2ad_mtx to prevent a race with
|
|
|
|
* another thread which might be concurrently calling
|
|
|
|
* l2arc_evict(). In that case, l2arc_evict() might have
|
|
|
|
* destroyed the header's L2 portion as we were waiting
|
|
|
|
* to acquire the l2ad_mtx.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2015-06-16 02:12:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Do we have more than one buf?
|
|
|
|
*/
|
2023-10-06 18:56:17 +03:00
|
|
|
if (hdr->b_l1hdr.b_buf != buf || !ARC_BUF_LAST(buf)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *nhdr;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t spa = hdr->b_spa;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
uint64_t lsize = HDR_GET_LSIZE(hdr);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
boolean_t protected = HDR_PROTECTED(hdr);
|
|
|
|
enum zio_compress compress = arc_hdr_get_compress(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
VERIFY3S(remove_reference(hdr, tag), >, 0);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(ARC_BUF_LAST(buf));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Pull the data off of this hdr and attach it to
|
2016-06-02 07:04:53 +03:00
|
|
|
* a new anonymous hdr. Also find the last buffer
|
|
|
|
* in the hdr's buffer list.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-04-12 00:56:54 +03:00
|
|
|
arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(lastbuf, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* If the current arc_buf_t and the hdr are sharing their data
|
2016-07-14 00:17:41 +03:00
|
|
|
* buffer, then we must stop sharing that block.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(lastbuf));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* First, sever the block sharing relationship between
|
2017-04-12 00:56:54 +03:00
|
|
|
* buf and the arc_buf_hdr_t.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
|
|
|
arc_unshare_buf(hdr, buf);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-22 18:52:49 +03:00
|
|
|
* Now we need to recreate the hdr's b_pabd. Since we
|
2016-07-14 00:17:41 +03:00
|
|
|
* have lastbuf handy, we try to share with it, but if
|
2016-07-22 18:52:49 +03:00
|
|
|
* we can't then we allocate a new b_pabd and copy the
|
2016-07-14 00:17:41 +03:00
|
|
|
* data from buf into it.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
if (arc_can_share(hdr, lastbuf)) {
|
|
|
|
arc_share_buf(hdr, lastbuf);
|
|
|
|
} else {
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, 0);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_copy_from_buf(hdr->b_l1hdr.b_pabd,
|
|
|
|
buf->b_data, psize);
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3P(lastbuf->b_data, !=, NULL);
|
|
|
|
} else if (HDR_SHARED_DATA(hdr)) {
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* Uncompressed shared buffers are always at the end
|
|
|
|
* of the list. Compressed buffers don't have the
|
|
|
|
* same requirements. This makes it hard to
|
|
|
|
* simply assert that the lastbuf is shared so
|
|
|
|
* we rely on the hdr's compression flags to determine
|
|
|
|
* if we have a compressed, shared buffer.
|
|
|
|
*/
|
|
|
|
ASSERT(arc_buf_is_shared(lastbuf) ||
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(state, !=, arc_l2c_only);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_remove_many(&state->arcs_size[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
if (zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(state, !=, arc_l2c_only);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(
|
|
|
|
&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-12-22 02:57:09 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_cksum_verify(buf);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-05-10 20:25:27 +03:00
|
|
|
/* if this is the last uncompressed buf free the checksum */
|
|
|
|
if (!arc_hdr_has_uncompressed_buf(hdr))
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
nhdr = arc_hdr_alloc(spa, psize, lsize, protected,
|
2021-08-17 19:15:54 +03:00
|
|
|
compress, hdr->b_complevel, type);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3U(nhdr->b_type, ==, type);
|
|
|
|
ASSERT(!HDR_SHARED_DATA(nhdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
nhdr->b_l1hdr.b_buf = buf;
|
2018-09-26 20:29:26 +03:00
|
|
|
(void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_hdr = nhdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
(void) zfs_refcount_add_many(&arc_anon->arcs_size[type],
|
2018-10-09 00:59:34 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
|
2015-01-13 06:52:19 +03:00
|
|
|
/* protected by hash lock, or hdr is on arc_anon */
|
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_mru_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_ghost_hits = 0;
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_anon, hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_released(arc_buf_t *buf)
|
|
|
|
{
|
2023-01-09 21:45:17 +03:00
|
|
|
return (buf->b_data != NULL &&
|
2014-12-30 06:12:23 +03:00
|
|
|
buf->b_hdr->b_l1hdr.b_state == arc_anon);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
int
|
|
|
|
arc_referenced(arc_buf_t *buf)
|
|
|
|
{
|
2023-01-09 21:45:17 +03:00
|
|
|
return (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_write_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp);
|
2016-07-22 18:52:49 +03:00
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If we're reexecuting this zio because the pool suspended, then
|
|
|
|
* cleanup any state that was previously set the first time the
|
2016-07-11 20:45:52 +03:00
|
|
|
* callback was invoked.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
arc_buf_unwatch(buf);
|
2016-07-22 18:52:49 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_unshare_buf(hdr, buf);
|
|
|
|
} else {
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
if (HDR_HAS_RABD(hdr))
|
|
|
|
arc_hdr_free_abd(hdr, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(!HDR_HAS_RABD(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr));
|
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
|
|
|
|
|
|
|
callback->awcb_ready(zio, buf, callback->awcb_private);
|
|
|
|
|
2022-12-22 23:10:24 +03:00
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
|
2022-12-22 23:10:24 +03:00
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
|
|
|
add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (BP_IS_PROTECTED(bp)) {
|
|
|
|
/* ZIL blocks are written through zio_rewrite */
|
|
|
|
ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);
|
|
|
|
|
2017-11-08 22:12:59 +03:00
|
|
|
if (BP_SHOULD_BYTESWAP(bp)) {
|
|
|
|
if (BP_GET_LEVEL(bp) > 0) {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap =
|
|
|
|
DMU_OT_BYTESWAP(BP_GET_TYPE(bp));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
|
|
|
}
|
|
|
|
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);
|
|
|
|
hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;
|
|
|
|
zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,
|
|
|
|
hdr->b_crypt_hdr.b_iv);
|
|
|
|
zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac);
|
ARC: Drop different size headers for crypto
To reduce memory usage ZFS crypto allocated bigger by 56 bytes ARC
headers only when specific block was encrypted on disk. It was a
nice optimization, except in some cases the code reallocated them
on fly, that invalidated header pointers from the buffers. Since
the buffers use different locking, it created number of races, that
were originally covered (at least partially) by b_evict_lock, used
also to protection evictions. But it has gone as part of #14340.
As result, as was found in #15293, arc_hdr_realloc_crypt() ended
up unprotected and causing use-after-free.
Instead of introducing some even more elaborate locking, this patch
just drops the difference between normal and protected headers. It
cost us additional 56 bytes per header, but with couple patches
saving 24 bytes, the net growth is only 32 bytes with total header
size of 232 bytes on FreeBSD, that IMHO is acceptable price for
simplicity. Additional locking would also end up consuming space,
time or both.
Reviewe-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15293
Closes #15347
2023-10-03 18:57:48 +03:00
|
|
|
} else {
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PROTECTED);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this block was written for raw encryption but the zio layer
|
|
|
|
* ended up only authenticating it, adjust the buffer flags now.
|
|
|
|
*/
|
|
|
|
if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
|
|
|
|
if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF)
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
|
2018-02-21 23:28:52 +03:00
|
|
|
} else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) {
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* this must be done after the buffer flags are adjusted */
|
|
|
|
arc_cksum_compute(buf);
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
enum zio_compress compress;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
compress = ZIO_COMPRESS_OFF;
|
|
|
|
} else {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
|
|
|
|
compress = BP_GET_COMPRESS(bp);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
HDR_SET_PSIZE(hdr, psize);
|
|
|
|
arc_hdr_set_compress(hdr, compress);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
hdr->b_complevel = zio->io_prop.zp_complevel;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
if (zio->io_error != 0 || psize == 0)
|
|
|
|
goto out;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* Fill the hdr with data. If the buffer is encrypted we have no choice
|
|
|
|
* but to copy the data into b_radb. If the hdr is compressed, the data
|
|
|
|
* we want is available from the zio, otherwise we can take it from
|
|
|
|
* the buf.
|
2016-07-22 18:52:49 +03:00
|
|
|
*
|
|
|
|
* We might be able to share the buf's data with the hdr here. However,
|
|
|
|
* doing so would cause the ARC to be full of linear ABDs if we write a
|
|
|
|
* lot of shareable data. As a compromise, we check whether scattered
|
|
|
|
* ABDs are allowed, and assume that if they are then the user wants
|
|
|
|
* the ARC to be primarily filled with them regardless of the data being
|
|
|
|
* written. Therefore, if they're allowed then we allocate one and copy
|
|
|
|
* the data into it; otherwise, we share the data directly if we can.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ARC_BUF_ENCRYPTED(buf)) {
|
2017-09-12 23:15:11 +03:00
|
|
|
ASSERT3U(psize, >, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(ARC_BUF_COMPRESSED(buf));
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |
|
2021-08-17 19:15:54 +03:00
|
|
|
ARC_HDR_USE_RESERVE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
} else if (!(HDR_UNCACHED(hdr) ||
|
|
|
|
abd_size_alloc_linear(arc_buf_size(buf))) ||
|
2021-07-28 02:05:47 +03:00
|
|
|
!arc_can_share(hdr, buf)) {
|
2016-07-22 18:52:49 +03:00
|
|
|
/*
|
|
|
|
* Ideally, we would always copy the io_abd into b_pabd, but the
|
|
|
|
* user may have disabled compressed ARC, thus we must check the
|
|
|
|
* hdr's compression setting rather than the io_bp's.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (BP_IS_ENCRYPTED(bp)) {
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3U(psize, >, 0);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |
|
|
|
|
ARC_HDR_USE_RESERVE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);
|
|
|
|
} else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT3U(psize, >, 0);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);
|
2016-07-22 18:52:49 +03:00
|
|
|
abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,
|
|
|
|
arc_buf_size(buf));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);
|
|
|
|
ASSERT(ARC_BUF_LAST(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
arc_share_buf(hdr, buf);
|
|
|
|
}
|
2016-07-22 18:52:49 +03:00
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
out:
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_verify(hdr, bp);
|
2016-07-22 18:52:49 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-05-15 18:02:28 +03:00
|
|
|
static void
|
|
|
|
arc_write_children_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
|
|
|
|
callback->awcb_children_ready(zio, buf, callback->awcb_private);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_write_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zio->io_error == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_verify(hdr, zio->io_bp);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
|
2013-12-09 22:37:51 +04:00
|
|
|
buf_discard_identity(hdr);
|
|
|
|
} else {
|
|
|
|
hdr->b_dva = *BP_IDENTITY(zio->io_bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2014-06-06 01:19:08 +04:00
|
|
|
* If the block to be written was all-zero or compressed enough to be
|
|
|
|
* embedded in the BP, no write was performed so there will be no
|
|
|
|
* dva/birth/checksum. The buffer must therefore remain anonymous
|
|
|
|
* (and uncached).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!HDR_EMPTY(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *exists;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3U(zio->io_error, ==, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_cksum_verify(buf);
|
|
|
|
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (exists != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This can only happen if we overwrite for
|
|
|
|
* sync-to-convergence, because we remove
|
|
|
|
* buffers from the hash table when we arc_free().
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad overwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
2018-10-01 20:42:05 +03:00
|
|
|
ASSERT(zfs_refcount_is_zero(
|
2014-12-30 06:12:23 +03:00
|
|
|
&exists->b_l1hdr.b_refcnt));
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_anon, exists);
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_hdr_destroy(exists);
|
2019-03-16 00:17:38 +03:00
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
ASSERT3P(exists, ==, NULL);
|
2013-05-10 23:47:54 +04:00
|
|
|
} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
|
|
|
|
/* nopwrite */
|
|
|
|
ASSERT(zio->io_prop.zp_nopwrite);
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad nopwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
/* Dedup */
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);
|
|
|
|
ASSERT(ARC_BUF_LAST(hdr->b_l1hdr.b_buf));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_anon);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(BP_GET_DEDUP(zio->io_bp));
|
|
|
|
ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
VERIFY3S(remove_reference(hdr, hdr), >, 0);
|
2008-12-03 23:09:06 +03:00
|
|
|
/* if it's not anon, we are doing a scrub */
|
2014-12-30 06:12:23 +03:00
|
|
|
if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_access(hdr, 0, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
VERIFY3S(remove_reference(hdr, hdr), >, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
callback->awcb_done(zio, buf, callback->awcb_private);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(zio->io_abd);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(callback, sizeof (arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
zio_t *
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
blkptr_t *bp, arc_buf_t *buf, boolean_t uncached, boolean_t l2arc,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
const zio_prop_t *zp, arc_write_done_func_t *ready,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
arc_write_done_func_t *children_ready, arc_write_done_func_t *done,
|
|
|
|
void *private, zio_priority_t priority, int zio_flags,
|
|
|
|
const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
arc_write_callback_t *callback;
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_t *zio;
|
2017-03-23 19:07:27 +03:00
|
|
|
zio_prop_t localprop = *zp;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(ready, !=, NULL);
|
|
|
|
ASSERT3P(done, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_ERROR(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2023-10-06 18:56:17 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
if (uncached)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);
|
|
|
|
else if (l2arc)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
|
2017-03-23 19:07:27 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ARC_BUF_ENCRYPTED(buf)) {
|
|
|
|
ASSERT(ARC_BUF_COMPRESSED(buf));
|
|
|
|
localprop.zp_encrypt = B_TRUE;
|
|
|
|
localprop.zp_compress = HDR_GET_COMPRESS(hdr);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
localprop.zp_complevel = hdr->b_complevel;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
localprop.zp_byteorder =
|
|
|
|
(hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?
|
|
|
|
ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(localprop.zp_salt, hdr->b_crypt_hdr.b_salt,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ZIO_DATA_SALT_LEN);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(localprop.zp_iv, hdr->b_crypt_hdr.b_iv,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ZIO_DATA_IV_LEN);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(localprop.zp_mac, hdr->b_crypt_hdr.b_mac,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ZIO_DATA_MAC_LEN);
|
|
|
|
if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) {
|
|
|
|
localprop.zp_nopwrite = B_FALSE;
|
|
|
|
localprop.zp_copies =
|
|
|
|
MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1);
|
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
} else if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));
|
|
|
|
localprop.zp_compress = HDR_GET_COMPRESS(hdr);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
localprop.zp_complevel = hdr->b_complevel;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
zio_flags |= ZIO_FLAG_RAW_COMPRESS;
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
2014-11-21 03:09:39 +03:00
|
|
|
callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_ready = ready;
|
2016-05-15 18:02:28 +03:00
|
|
|
callback->awcb_children_ready = children_ready;
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_done = done;
|
|
|
|
callback->awcb_private = private;
|
|
|
|
callback->awcb_buf = buf;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2016-07-22 18:52:49 +03:00
|
|
|
* The hdr's b_pabd is now stale, free it now. A new data block
|
2016-06-02 07:04:53 +03:00
|
|
|
* will be allocated when the zio pipeline calls arc_write_ready().
|
|
|
|
*/
|
2016-07-22 18:52:49 +03:00
|
|
|
if (hdr->b_l1hdr.b_pabd != NULL) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* If the buf is currently sharing the data block with
|
|
|
|
* the hdr then we need to break that relationship here.
|
|
|
|
* The hdr will remain with a NULL data pointer and the
|
|
|
|
* buf will take sole ownership of the block.
|
|
|
|
*/
|
2023-10-20 22:38:37 +03:00
|
|
|
if (ARC_BUF_SHARED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_unshare_buf(hdr, buf);
|
|
|
|
} else {
|
2023-10-20 22:38:37 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_free_abd(hdr, B_FALSE);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
VERIFY3P(buf->b_data, !=, NULL);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
if (HDR_HAS_RABD(hdr))
|
|
|
|
arc_hdr_free_abd(hdr, B_TRUE);
|
|
|
|
|
2017-11-09 00:32:15 +03:00
|
|
|
if (!(zio_flags & ZIO_FLAG_RAW))
|
|
|
|
arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
zio = zio_write(pio, spa, txg, bp,
|
|
|
|
abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
|
2017-03-23 19:07:27 +03:00
|
|
|
HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
|
2016-05-15 18:02:28 +03:00
|
|
|
(children_ready != NULL) ? arc_write_children_ready : NULL,
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
arc_write_done, callback, priority, zio_flags, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_tempreserve_clear(uint64_t reserve)
|
|
|
|
{
|
|
|
|
atomic_add_64(&arc_tempreserve, -reserve);
|
|
|
|
ASSERT((int64_t)arc_tempreserve >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2017-09-27 04:45:19 +03:00
|
|
|
arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t anon_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-22 16:37:37 +03:00
|
|
|
if (!arc_no_grow &&
|
|
|
|
reserve > arc_c/4 &&
|
|
|
|
reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT))
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_c = MIN(arc_c_max, reserve * 4);
|
2014-04-29 00:56:47 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle when the calculated memory footprint for the TXG
|
|
|
|
* exceeds the target ARC size.
|
|
|
|
*/
|
2012-01-20 22:58:57 +04:00
|
|
|
if (reserve > arc_c) {
|
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
|
2014-04-29 00:56:47 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2012-01-20 22:58:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Don't count loaned bufs as in flight dirty data to prevent long
|
|
|
|
* network delays from blocking transactions that are ready to be
|
|
|
|
* assigned to a txg.
|
|
|
|
*/
|
2017-04-12 00:56:54 +03:00
|
|
|
|
|
|
|
/* assert that it has not wrapped around */
|
|
|
|
ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);
|
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
anon_size = MAX((int64_t)
|
|
|
|
(zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]) +
|
|
|
|
zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]) -
|
2015-06-27 01:14:45 +03:00
|
|
|
arc_loaned_bytes), 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Writes will, almost always, require additional memory allocations
|
2013-06-11 21:12:34 +04:00
|
|
|
* in order to compress/encrypt/etc the data. We therefore need to
|
2008-11-20 23:01:55 +03:00
|
|
|
* make sure that there is sufficient available memory for this.
|
|
|
|
*/
|
2017-09-27 04:45:19 +03:00
|
|
|
error = arc_memory_throttle(spa, reserve, txg);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle writes when the amount of dirty data in the cache
|
|
|
|
* gets too large. We try to keep the cache less than half full
|
|
|
|
* of dirty blocks so that our sync times don't grow too large.
|
2017-09-27 04:45:19 +03:00
|
|
|
*
|
|
|
|
* In the case of one pool being built on another pool, we want
|
|
|
|
* to make sure we don't end up throttling the lower (backing)
|
|
|
|
* pool when the upper pool is the majority contributor to dirty
|
|
|
|
* data. To insure we make forward progress during throttling, we
|
|
|
|
* also check the current pool's net dirty data and only throttle
|
|
|
|
* if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
|
|
|
|
* data in the cache.
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* Note: if two requests come in concurrently, we might let them
|
|
|
|
* both succeed, when one of them should fail. Not a huge deal.
|
|
|
|
*/
|
2017-09-27 04:45:19 +03:00
|
|
|
uint64_t total_dirty = reserve + arc_tempreserve + anon_size;
|
|
|
|
uint64_t spa_dirty_anon = spa_dirty_data(spa);
|
2020-11-10 21:39:26 +03:00
|
|
|
uint64_t rarc_c = arc_warm ? arc_c : arc_c_max;
|
|
|
|
if (total_dirty > rarc_c * zfs_arc_dirty_limit_percent / 100 &&
|
|
|
|
anon_size > rarc_c * zfs_arc_anon_limit_percent / 100 &&
|
2017-09-27 04:45:19 +03:00
|
|
|
spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {
|
2018-03-22 01:37:32 +03:00
|
|
|
#ifdef ZFS_DEBUG
|
2018-10-01 20:42:05 +03:00
|
|
|
uint64_t meta_esize = zfs_refcount_count(
|
|
|
|
&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t data_esize =
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
2008-11-20 23:01:55 +03:00
|
|
|
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
|
2020-11-10 21:39:26 +03:00
|
|
|
"anon_data=%lluK tempreserve=%lluK rarc_c=%lluK\n",
|
2021-06-23 07:53:45 +03:00
|
|
|
(u_longlong_t)arc_tempreserve >> 10,
|
|
|
|
(u_longlong_t)meta_esize >> 10,
|
|
|
|
(u_longlong_t)data_esize >> 10,
|
|
|
|
(u_longlong_t)reserve >> 10,
|
|
|
|
(u_longlong_t)rarc_c >> 10);
|
2018-03-22 01:37:32 +03:00
|
|
|
#endif
|
2012-01-20 22:58:57 +04:00
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
atomic_add_64(&arc_tempreserve, reserve);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2012-01-31 01:28:40 +04:00
|
|
|
static void
|
|
|
|
arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
kstat_named_t *data, kstat_named_t *metadata,
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t *evict_data, kstat_named_t *evict_metadata)
|
|
|
|
{
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
data->value.ui64 =
|
|
|
|
zfs_refcount_count(&state->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
metadata->value.ui64 =
|
|
|
|
zfs_refcount_count(&state->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
size->value.ui64 = data->value.ui64 + metadata->value.ui64;
|
2016-06-02 07:04:53 +03:00
|
|
|
evict_data->value.ui64 =
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
|
2016-06-02 07:04:53 +03:00
|
|
|
evict_metadata->value.ui64 =
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
|
2012-01-31 01:28:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
arc_kstat_update(kstat_t *ksp, int rw)
|
|
|
|
{
|
|
|
|
arc_stats_t *as = ksp->ks_data;
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
if (rw == KSTAT_WRITE)
|
2017-08-03 07:16:12 +03:00
|
|
|
return (SET_ERROR(EACCES));
|
2021-06-17 03:19:34 +03:00
|
|
|
|
|
|
|
as->arcstat_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_iohits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_misses);
|
|
|
|
as->arcstat_demand_data_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_data_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_demand_data_iohits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_data_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_demand_data_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_data_misses);
|
|
|
|
as->arcstat_demand_metadata_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_metadata_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_demand_metadata_iohits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_metadata_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_demand_metadata_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_metadata_misses);
|
|
|
|
as->arcstat_prefetch_data_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_data_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_prefetch_data_iohits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_data_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_prefetch_data_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_data_misses);
|
|
|
|
as->arcstat_prefetch_metadata_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_metadata_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_prefetch_metadata_iohits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_metadata_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_prefetch_metadata_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prefetch_metadata_misses);
|
|
|
|
as->arcstat_mru_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_mru_hits);
|
|
|
|
as->arcstat_mru_ghost_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_mru_ghost_hits);
|
|
|
|
as->arcstat_mfu_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_mfu_hits);
|
|
|
|
as->arcstat_mfu_ghost_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_mfu_ghost_hits);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
as->arcstat_uncached_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_uncached_hits);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_deleted.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_deleted);
|
|
|
|
as->arcstat_mutex_miss.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_mutex_miss);
|
|
|
|
as->arcstat_access_skip.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_access_skip);
|
|
|
|
as->arcstat_evict_skip.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_skip);
|
|
|
|
as->arcstat_evict_not_enough.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_not_enough);
|
|
|
|
as->arcstat_evict_l2_cached.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_cached);
|
|
|
|
as->arcstat_evict_l2_eligible.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_eligible);
|
|
|
|
as->arcstat_evict_l2_eligible_mfu.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mfu);
|
|
|
|
as->arcstat_evict_l2_eligible_mru.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mru);
|
|
|
|
as->arcstat_evict_l2_ineligible.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_ineligible);
|
|
|
|
as->arcstat_evict_l2_skip.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_evict_l2_skip);
|
|
|
|
as->arcstat_hash_collisions.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_hash_collisions);
|
|
|
|
as->arcstat_hash_chains.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_hash_chains);
|
|
|
|
as->arcstat_size.value.ui64 =
|
|
|
|
aggsum_value(&arc_sums.arcstat_size);
|
|
|
|
as->arcstat_compressed_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_compressed_size);
|
|
|
|
as->arcstat_uncompressed_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_uncompressed_size);
|
|
|
|
as->arcstat_overhead_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_overhead_size);
|
|
|
|
as->arcstat_hdr_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_hdr_size);
|
|
|
|
as->arcstat_data_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_data_size);
|
|
|
|
as->arcstat_metadata_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_metadata_size);
|
|
|
|
as->arcstat_dbuf_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_dbuf_size);
|
2020-08-20 20:55:02 +03:00
|
|
|
#if defined(COMPAT_FREEBSD11)
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_other_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_bonus_size) +
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_value(&arc_sums.arcstat_dnode_size) +
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_value(&arc_sums.arcstat_dbuf_size);
|
2020-08-20 20:55:02 +03:00
|
|
|
#endif
|
2017-05-25 21:32:40 +03:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
arc_kstat_update_state(arc_anon,
|
|
|
|
&as->arcstat_anon_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_anon_data,
|
|
|
|
&as->arcstat_anon_metadata,
|
2021-06-17 03:19:34 +03:00
|
|
|
&as->arcstat_anon_evictable_data,
|
|
|
|
&as->arcstat_anon_evictable_metadata);
|
|
|
|
arc_kstat_update_state(arc_mru,
|
|
|
|
&as->arcstat_mru_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_mru_data,
|
|
|
|
&as->arcstat_mru_metadata,
|
2021-06-17 03:19:34 +03:00
|
|
|
&as->arcstat_mru_evictable_data,
|
|
|
|
&as->arcstat_mru_evictable_metadata);
|
|
|
|
arc_kstat_update_state(arc_mru_ghost,
|
|
|
|
&as->arcstat_mru_ghost_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_mru_ghost_data,
|
|
|
|
&as->arcstat_mru_ghost_metadata,
|
2021-06-17 03:19:34 +03:00
|
|
|
&as->arcstat_mru_ghost_evictable_data,
|
|
|
|
&as->arcstat_mru_ghost_evictable_metadata);
|
|
|
|
arc_kstat_update_state(arc_mfu,
|
|
|
|
&as->arcstat_mfu_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_mfu_data,
|
|
|
|
&as->arcstat_mfu_metadata,
|
2021-06-17 03:19:34 +03:00
|
|
|
&as->arcstat_mfu_evictable_data,
|
|
|
|
&as->arcstat_mfu_evictable_metadata);
|
|
|
|
arc_kstat_update_state(arc_mfu_ghost,
|
|
|
|
&as->arcstat_mfu_ghost_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_mfu_ghost_data,
|
|
|
|
&as->arcstat_mfu_ghost_metadata,
|
2021-06-17 03:19:34 +03:00
|
|
|
&as->arcstat_mfu_ghost_evictable_data,
|
|
|
|
&as->arcstat_mfu_ghost_evictable_metadata);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_kstat_update_state(arc_uncached,
|
|
|
|
&as->arcstat_uncached_size,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
&as->arcstat_uncached_data,
|
|
|
|
&as->arcstat_uncached_metadata,
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
&as->arcstat_uncached_evictable_data,
|
|
|
|
&as->arcstat_uncached_evictable_metadata);
|
2021-06-17 03:19:34 +03:00
|
|
|
|
|
|
|
as->arcstat_dnode_size.value.ui64 =
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_value(&arc_sums.arcstat_dnode_size);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_bonus_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_bonus_size);
|
|
|
|
as->arcstat_l2_hits.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_hits);
|
|
|
|
as->arcstat_l2_misses.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_misses);
|
|
|
|
as->arcstat_l2_prefetch_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_prefetch_asize);
|
|
|
|
as->arcstat_l2_mru_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_mru_asize);
|
|
|
|
as->arcstat_l2_mfu_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_mfu_asize);
|
|
|
|
as->arcstat_l2_bufc_data_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_bufc_data_asize);
|
|
|
|
as->arcstat_l2_bufc_metadata_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_bufc_metadata_asize);
|
|
|
|
as->arcstat_l2_feeds.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_feeds);
|
|
|
|
as->arcstat_l2_rw_clash.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rw_clash);
|
|
|
|
as->arcstat_l2_read_bytes.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_read_bytes);
|
|
|
|
as->arcstat_l2_write_bytes.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_write_bytes);
|
|
|
|
as->arcstat_l2_writes_sent.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_writes_sent);
|
|
|
|
as->arcstat_l2_writes_done.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_writes_done);
|
|
|
|
as->arcstat_l2_writes_error.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_writes_error);
|
|
|
|
as->arcstat_l2_writes_lock_retry.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_writes_lock_retry);
|
|
|
|
as->arcstat_l2_evict_lock_retry.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_evict_lock_retry);
|
|
|
|
as->arcstat_l2_evict_reading.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_evict_reading);
|
|
|
|
as->arcstat_l2_evict_l1cached.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_evict_l1cached);
|
|
|
|
as->arcstat_l2_free_on_write.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_free_on_write);
|
|
|
|
as->arcstat_l2_abort_lowmem.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_abort_lowmem);
|
|
|
|
as->arcstat_l2_cksum_bad.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_cksum_bad);
|
|
|
|
as->arcstat_l2_io_error.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_io_error);
|
|
|
|
as->arcstat_l2_lsize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_lsize);
|
|
|
|
as->arcstat_l2_psize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_psize);
|
|
|
|
as->arcstat_l2_hdr_size.value.ui64 =
|
|
|
|
aggsum_value(&arc_sums.arcstat_l2_hdr_size);
|
|
|
|
as->arcstat_l2_log_blk_writes.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_log_blk_writes);
|
|
|
|
as->arcstat_l2_log_blk_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_log_blk_asize);
|
|
|
|
as->arcstat_l2_log_blk_count.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_log_blk_count);
|
|
|
|
as->arcstat_l2_rebuild_success.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_success);
|
|
|
|
as->arcstat_l2_rebuild_abort_unsupported.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_unsupported);
|
|
|
|
as->arcstat_l2_rebuild_abort_io_errors.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_io_errors);
|
|
|
|
as->arcstat_l2_rebuild_abort_dh_errors.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);
|
|
|
|
as->arcstat_l2_rebuild_abort_cksum_lb_errors.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);
|
|
|
|
as->arcstat_l2_rebuild_abort_lowmem.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_lowmem);
|
|
|
|
as->arcstat_l2_rebuild_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_size);
|
|
|
|
as->arcstat_l2_rebuild_asize.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_asize);
|
|
|
|
as->arcstat_l2_rebuild_bufs.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs);
|
|
|
|
as->arcstat_l2_rebuild_bufs_precached.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs_precached);
|
|
|
|
as->arcstat_l2_rebuild_log_blks.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_l2_rebuild_log_blks);
|
|
|
|
as->arcstat_memory_throttle_count.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_memory_throttle_count);
|
|
|
|
as->arcstat_memory_direct_count.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_memory_direct_count);
|
|
|
|
as->arcstat_memory_indirect_count.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_memory_indirect_count);
|
|
|
|
|
|
|
|
as->arcstat_memory_all_bytes.value.ui64 =
|
|
|
|
arc_all_memory();
|
|
|
|
as->arcstat_memory_free_bytes.value.ui64 =
|
|
|
|
arc_free_memory();
|
|
|
|
as->arcstat_memory_available_bytes.value.i64 =
|
|
|
|
arc_available_memory();
|
|
|
|
|
|
|
|
as->arcstat_prune.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prune);
|
|
|
|
as->arcstat_meta_used.value.ui64 =
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_value(&arc_sums.arcstat_meta_used);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_async_upgrade_sync.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_async_upgrade_sync);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_predictive_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_predictive_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_demand_hit_predictive_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_hit_predictive_prefetch);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_demand_iohit_predictive_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_iohit_predictive_prefetch);
|
|
|
|
as->arcstat_prescient_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_prescient_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_demand_hit_prescient_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_hit_prescient_prefetch);
|
2022-12-22 23:10:24 +03:00
|
|
|
as->arcstat_demand_iohit_prescient_prefetch.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_demand_iohit_prescient_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
as->arcstat_raw_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_raw_size);
|
|
|
|
as->arcstat_cached_only_in_progress.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_cached_only_in_progress);
|
|
|
|
as->arcstat_abd_chunk_waste_size.value.ui64 =
|
|
|
|
wmsum_value(&arc_sums.arcstat_abd_chunk_waste_size);
|
2012-01-31 01:28:40 +04:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* This function *must* return indices evenly distributed between all
|
|
|
|
* sublists of the multilist. This is needed due to how the ARC eviction
|
|
|
|
* code is laid out; arc_evict_state() assumes ARC buffers are evenly
|
|
|
|
* distributed between all sublists and uses this assumption when
|
|
|
|
* deciding which sublist to evict from and how much to evict from it.
|
|
|
|
*/
|
2020-06-15 21:30:37 +03:00
|
|
|
static unsigned int
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_state_multilist_index_func(multilist_t *ml, void *obj)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = obj;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We rely on b_dva to generate evenly distributed index
|
|
|
|
* numbers using buf_hash below. So, as an added precaution,
|
|
|
|
* let's make sure we never add empty buffers to the arc lists.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_EMPTY(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The assumption here, is the hash value for a given
|
|
|
|
* arc_buf_hdr_t will remain constant throughout its lifetime
|
|
|
|
* (i.e. its b_spa, b_dva, and b_birth fields don't change).
|
|
|
|
* Thus, we don't need to store the header's sublist index
|
|
|
|
* on insertion, as this index can be recalculated on removal.
|
|
|
|
*
|
|
|
|
* Also, the low order bits of the hash value are thought to be
|
|
|
|
* distributed evenly. Otherwise, in the case that the multilist
|
|
|
|
* has a power of two number of sublists, each sublists' usage
|
2021-06-29 15:59:14 +03:00
|
|
|
* would not be evenly distributed. In this context full 64bit
|
|
|
|
* division would be a waste of time, so limit it to 32 bits.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2021-06-29 15:59:14 +03:00
|
|
|
return ((unsigned int)buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_get_num_sublists(ml));
|
|
|
|
}
|
|
|
|
|
2021-08-17 18:55:34 +03:00
|
|
|
static unsigned int
|
|
|
|
arc_state_l2c_multilist_index_func(multilist_t *ml, void *obj)
|
|
|
|
{
|
|
|
|
panic("Header %p insert into arc_l2c_only %p", obj, ml);
|
|
|
|
}
|
|
|
|
|
2020-04-10 01:39:48 +03:00
|
|
|
#define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do { \
|
|
|
|
if ((do_warn) && (tuning) && ((tuning) != (value))) { \
|
|
|
|
cmn_err(CE_WARN, \
|
|
|
|
"ignoring tunable %s (using %llu instead)", \
|
2021-06-05 14:14:12 +03:00
|
|
|
(#tuning), (u_longlong_t)(value)); \
|
2020-04-10 01:39:48 +03:00
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* Called during module initialization and periodically thereafter to
|
2019-10-27 01:22:19 +03:00
|
|
|
* apply reasonable changes to the exposed performance tunings. Can also be
|
|
|
|
* called explicitly by param_set_arc_*() functions when ARC tunables are
|
|
|
|
* updated manually. Non-zero zfs_* values which differ from the currently set
|
|
|
|
* values will be applied.
|
2015-06-26 21:28:18 +03:00
|
|
|
*/
|
2019-10-27 01:22:19 +03:00
|
|
|
void
|
2020-04-10 01:39:48 +03:00
|
|
|
arc_tuning_update(boolean_t verbose)
|
2015-06-26 21:28:18 +03:00
|
|
|
{
|
2017-06-29 19:57:27 +03:00
|
|
|
uint64_t allmem = arc_all_memory();
|
2016-10-31 22:24:54 +03:00
|
|
|
|
2020-04-10 01:39:48 +03:00
|
|
|
/* Valid range: 32M - <arc_c_max> */
|
|
|
|
if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) &&
|
|
|
|
(zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) &&
|
|
|
|
(zfs_arc_min <= arc_c_max)) {
|
|
|
|
arc_c_min = zfs_arc_min;
|
|
|
|
arc_c = MAX(arc_c, arc_c_min);
|
|
|
|
}
|
|
|
|
WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose);
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Valid range: 64M - <all physical memory> */
|
|
|
|
if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&
|
2021-08-16 18:35:19 +03:00
|
|
|
(zfs_arc_max >= MIN_ARC_MAX) && (zfs_arc_max < allmem) &&
|
2015-06-26 21:28:18 +03:00
|
|
|
(zfs_arc_max > arc_c_min)) {
|
|
|
|
arc_c_max = zfs_arc_max;
|
Set initial arc_c to arc_c_min instead of arc_c_max
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure. I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return. All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.
The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.
Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion. It is also getting in sync with code in
arc_get_data_impl().
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10437
2020-06-18 00:27:04 +03:00
|
|
|
arc_c = MIN(arc_c, arc_c_max);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
if (arc_dnode_limit > arc_c_max)
|
|
|
|
arc_dnode_limit = arc_c_max;
|
2015-06-26 21:28:18 +03:00
|
|
|
}
|
2020-04-10 01:39:48 +03:00
|
|
|
WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose);
|
2015-06-26 21:28:18 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
/* Valid range: 0 - <all physical memory> */
|
|
|
|
arc_dnode_limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit :
|
|
|
|
MIN(zfs_arc_dnode_limit_percent, 100) * arc_c_max / 100;
|
|
|
|
WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_limit, verbose);
|
2016-07-13 15:42:40 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Valid range: 1 - N */
|
|
|
|
if (zfs_arc_grow_retry)
|
|
|
|
arc_grow_retry = zfs_arc_grow_retry;
|
|
|
|
|
|
|
|
/* Valid range: 1 - N */
|
|
|
|
if (zfs_arc_shrink_shift) {
|
|
|
|
arc_shrink_shift = zfs_arc_shrink_shift;
|
|
|
|
arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1);
|
|
|
|
}
|
|
|
|
|
2017-11-16 04:27:01 +03:00
|
|
|
/* Valid range: 1 - N ms */
|
|
|
|
if (zfs_arc_min_prefetch_ms)
|
|
|
|
arc_min_prefetch_ms = zfs_arc_min_prefetch_ms;
|
|
|
|
|
|
|
|
/* Valid range: 1 - N ms */
|
|
|
|
if (zfs_arc_min_prescient_prefetch_ms) {
|
|
|
|
arc_min_prescient_prefetch_ms =
|
|
|
|
zfs_arc_min_prescient_prefetch_ms;
|
|
|
|
}
|
2015-07-27 23:17:32 +03:00
|
|
|
|
2015-07-28 21:30:00 +03:00
|
|
|
/* Valid range: 0 - 100 */
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
if (zfs_arc_lotsfree_percent <= 100)
|
2015-07-28 21:30:00 +03:00
|
|
|
arc_lotsfree_percent = zfs_arc_lotsfree_percent;
|
2020-04-10 01:39:48 +03:00
|
|
|
WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent,
|
|
|
|
verbose);
|
2015-07-28 21:30:00 +03:00
|
|
|
|
2015-07-27 23:17:32 +03:00
|
|
|
/* Valid range: 0 - <all physical memory> */
|
|
|
|
if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))
|
2022-09-27 03:02:38 +03:00
|
|
|
arc_sys_free = MIN(zfs_arc_sys_free, allmem);
|
2020-04-10 01:39:48 +03:00
|
|
|
WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose);
|
2015-06-26 21:28:18 +03:00
|
|
|
}
|
|
|
|
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
static void
|
|
|
|
arc_state_multilist_init(multilist_t *ml,
|
|
|
|
multilist_sublist_index_func_t *index_func, int *maxcountp)
|
|
|
|
{
|
|
|
|
multilist_create(ml, sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func);
|
|
|
|
*maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml));
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
arc_state_init(void)
|
|
|
|
{
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
int num_sublists = 0;
|
|
|
|
|
|
|
|
arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_multilist_index_func, &num_sublists);
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
|
2021-08-17 18:55:34 +03:00
|
|
|
/*
|
|
|
|
* L2 headers should never be on the L2 state list since they don't
|
|
|
|
* have L1 headers allocated. Special index function asserts that.
|
|
|
|
*/
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
arc_state_l2c_multilist_index_func, &num_sublists);
|
|
|
|
arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
|
|
|
|
arc_state_l2c_multilist_index_func, &num_sublists);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Keep track of the number of markers needed to reclaim buffers from
|
|
|
|
* any ARC state. The markers will be pre-allocated so as to minimize
|
|
|
|
* the number of memory allocations performed by the eviction thread.
|
|
|
|
*/
|
|
|
|
arc_state_evict_marker_count = num_sublists;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);
|
2018-10-01 20:42:05 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
|
|
|
|
wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA], 0);
|
|
|
|
wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA], 0);
|
|
|
|
wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA], 0);
|
|
|
|
wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA], 0);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_hits, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_iohits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_demand_data_hits, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_data_iohits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_data_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_demand_metadata_hits, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_metadata_iohits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_metadata_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_data_hits, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_data_iohits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_data_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_metadata_hits, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_metadata_iohits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_prefetch_metadata_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_mru_hits, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_mru_ghost_hits, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_mfu_hits, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_mfu_ghost_hits, 0);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_uncached_hits, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_deleted, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_mutex_miss, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_access_skip, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_skip, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_not_enough, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_cached, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_eligible, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mfu, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mru, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_ineligible, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_evict_l2_skip, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_hash_collisions, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_hash_chains, 0);
|
|
|
|
aggsum_init(&arc_sums.arcstat_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_compressed_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_uncompressed_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_overhead_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_hdr_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_data_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_metadata_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_dbuf_size, 0);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_dnode_size, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_bonus_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_hits, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_misses, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_prefetch_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_mru_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_mfu_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_bufc_data_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_bufc_metadata_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_feeds, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rw_clash, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_read_bytes, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_write_bytes, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_writes_sent, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_writes_done, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_writes_error, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_writes_lock_retry, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_evict_lock_retry, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_evict_reading, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_evict_l1cached, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_free_on_write, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_abort_lowmem, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_cksum_bad, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_io_error, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_lsize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_psize, 0);
|
|
|
|
aggsum_init(&arc_sums.arcstat_l2_hdr_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_log_blk_writes, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_log_blk_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_log_blk_count, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_success, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_unsupported, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_io_errors, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_dh_errors, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_lowmem, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_asize, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs_precached, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_l2_rebuild_log_blks, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_memory_throttle_count, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_memory_direct_count, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_memory_indirect_count, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_prune, 0);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_meta_used, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_async_upgrade_sync, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_predictive_prefetch, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_hit_predictive_prefetch, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_iohit_predictive_prefetch, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_prescient_prefetch, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_hit_prescient_prefetch, 0);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_demand_iohit_prescient_prefetch, 0);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_init(&arc_sums.arcstat_raw_size, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_cached_only_in_progress, 0);
|
|
|
|
wmsum_init(&arc_sums.arcstat_abd_chunk_waste_size, 0);
|
2017-05-25 21:32:40 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_anon->arcs_state = ARC_STATE_ANON;
|
|
|
|
arc_mru->arcs_state = ARC_STATE_MRU;
|
|
|
|
arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;
|
|
|
|
arc_mfu->arcs_state = ARC_STATE_MFU;
|
|
|
|
arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;
|
|
|
|
arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_uncached->arcs_state = ARC_STATE_UNCACHED;
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_state_fini(void)
|
|
|
|
{
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);
|
2018-10-01 20:42:05 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);
|
|
|
|
zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_DATA]);
|
|
|
|
zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_DATA]);
|
2017-05-25 21:32:40 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);
|
|
|
|
wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);
|
|
|
|
wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);
|
|
|
|
wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_data_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_data_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_data_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_metadata_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_metadata_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_metadata_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_data_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_data_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_data_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_metadata_hits);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_metadata_iohits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_prefetch_metadata_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_mru_hits);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_mru_ghost_hits);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_mfu_hits);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_mfu_ghost_hits);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_uncached_hits);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_deleted);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_mutex_miss);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_access_skip);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_skip);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_not_enough);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_cached);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_eligible);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mfu);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mru);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_ineligible);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_evict_l2_skip);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_hash_collisions);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_hash_chains);
|
|
|
|
aggsum_fini(&arc_sums.arcstat_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_compressed_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_uncompressed_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_overhead_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_hdr_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_data_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_metadata_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_dbuf_size);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_dnode_size);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_bonus_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_hits);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_misses);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_prefetch_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_mru_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_mfu_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_bufc_data_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_bufc_metadata_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_feeds);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rw_clash);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_read_bytes);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_write_bytes);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_writes_sent);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_writes_done);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_writes_error);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_writes_lock_retry);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_evict_lock_retry);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_evict_reading);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_evict_l1cached);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_free_on_write);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_abort_lowmem);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_cksum_bad);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_io_error);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_lsize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_psize);
|
|
|
|
aggsum_fini(&arc_sums.arcstat_l2_hdr_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_log_blk_writes);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_log_blk_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_log_blk_count);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_success);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_unsupported);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_io_errors);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_lowmem);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_asize);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs_precached);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_l2_rebuild_log_blks);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_memory_throttle_count);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_memory_direct_count);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_memory_indirect_count);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_prune);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_meta_used);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_async_upgrade_sync);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_predictive_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_hit_predictive_prefetch);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_iohit_predictive_prefetch);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_prescient_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_hit_prescient_prefetch);
|
2022-12-22 23:10:24 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_demand_iohit_prescient_prefetch);
|
2021-06-17 03:19:34 +03:00
|
|
|
wmsum_fini(&arc_sums.arcstat_raw_size);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_cached_only_in_progress);
|
|
|
|
wmsum_fini(&arc_sums.arcstat_abd_chunk_waste_size);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
2017-09-30 01:49:19 +03:00
|
|
|
arc_target_bytes(void)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
2017-09-30 01:49:19 +03:00
|
|
|
return (arc_c);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
2020-12-11 01:09:23 +03:00
|
|
|
void
|
|
|
|
arc_set_limits(uint64_t allmem)
|
|
|
|
{
|
|
|
|
/* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */
|
|
|
|
arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT);
|
|
|
|
|
|
|
|
/* How to set default max varies by platform. */
|
|
|
|
arc_c_max = arc_default_max(arc_c_min, allmem);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_init(void)
|
|
|
|
{
|
2016-10-31 22:24:54 +03:00
|
|
|
uint64_t percent, allmem = arc_all_memory();
|
2020-07-22 19:51:47 +03:00
|
|
|
mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t),
|
|
|
|
offsetof(arc_evict_waiter_t, aew_node));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2018-02-06 03:57:53 +03:00
|
|
|
arc_min_prefetch_ms = 1000;
|
|
|
|
arc_min_prescient_prefetch_ms = 6000;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-18 20:23:19 +03:00
|
|
|
#if defined(_KERNEL)
|
|
|
|
arc_lowmem_init();
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
|
2020-12-11 01:09:23 +03:00
|
|
|
arc_set_limits(allmem);
|
2020-03-27 19:14:46 +03:00
|
|
|
|
2021-08-16 18:35:19 +03:00
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
|
|
|
* If zfs_arc_max is non-zero at init, meaning it was set in the kernel
|
|
|
|
* environment before the module was loaded, don't block setting the
|
|
|
|
* maximum because it is less than arc_c_min, instead, reset arc_c_min
|
|
|
|
* to a lower value.
|
|
|
|
* zfs_arc_min will be handled by arc_tuning_update().
|
|
|
|
*/
|
|
|
|
if (zfs_arc_max != 0 && zfs_arc_max >= MIN_ARC_MAX &&
|
|
|
|
zfs_arc_max < allmem) {
|
|
|
|
arc_c_max = zfs_arc_max;
|
|
|
|
if (arc_c_min >= arc_c_max) {
|
|
|
|
arc_c_min = MAX(zfs_arc_max / 2,
|
|
|
|
2ULL << SPA_MAXBLOCKSHIFT);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#else
|
2016-01-12 00:52:17 +03:00
|
|
|
/*
|
|
|
|
* In userland, there's only the memory pressure that we artificially
|
|
|
|
* create (see arc_available_memory()). Don't let arc_c get too
|
|
|
|
* small, because it can cause transactions to be larger than
|
|
|
|
* arc_c, causing arc_tempreserve_space() to fail.
|
|
|
|
*/
|
2016-01-24 22:11:15 +03:00
|
|
|
arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT);
|
2016-01-12 00:52:17 +03:00
|
|
|
#endif
|
|
|
|
|
Set initial arc_c to arc_c_min instead of arc_c_max
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure. I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return. All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.
The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.
Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion. It is also getting in sync with code in
arc_get_data_impl().
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10437
2020-06-18 00:27:04 +03:00
|
|
|
arc_c = arc_c_min;
|
2016-08-11 06:15:37 +03:00
|
|
|
/*
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
* 32-bit fixed point fractions of metadata from total ARC size,
|
|
|
|
* MRU data from all data and MRU metadata from all metadata.
|
2016-08-11 06:15:37 +03:00
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_meta = (1ULL << 32) / 4; /* Metadata is 25% of arc_c. */
|
|
|
|
arc_pd = (1ULL << 32) / 2; /* Data MRU is 50% of data. */
|
|
|
|
arc_pm = (1ULL << 32) / 2; /* Metadata MRU is 50% of metadata. */
|
|
|
|
|
2016-08-11 06:15:37 +03:00
|
|
|
percent = MIN(zfs_arc_dnode_limit_percent, 100);
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_dnode_limit = arc_c_max * percent / 100;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Apply user specified tunings */
|
2020-04-10 01:39:48 +03:00
|
|
|
arc_tuning_update(B_TRUE);
|
2015-06-25 01:49:08 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* if kmem_flags are set, lets try to use less memory */
|
|
|
|
if (kmem_debugging())
|
|
|
|
arc_c = arc_c / 2;
|
|
|
|
if (arc_c < arc_c_min)
|
|
|
|
arc_c = arc_c_min;
|
|
|
|
|
2020-12-11 01:09:23 +03:00
|
|
|
arc_register_hotplug();
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_state_init();
|
2017-03-16 02:41:52 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_init();
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
list_create(&arc_prune_list, sizeof (arc_prune_t),
|
|
|
|
offsetof(arc_prune_t, p_node));
|
|
|
|
mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-12-23 04:07:13 +03:00
|
|
|
arc_prune_taskq = taskq_create("arc_prune", zfs_arc_prune_task_threads,
|
|
|
|
defclsyspri, 100, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC);
|
2015-05-30 17:57:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
|
|
|
|
sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
|
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
arc_ksp->ks_data = &arc_stats;
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_ksp->ks_update = arc_kstat_update;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_install(arc_ksp);
|
|
|
|
}
|
|
|
|
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
arc_state_evict_markers =
|
|
|
|
arc_state_alloc_markers(arc_state_evict_marker_count);
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
arc_evict_zthr = zthr_create_timer("arc_evict",
|
|
|
|
arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1), defclsyspri);
|
2020-07-29 19:43:33 +03:00
|
|
|
arc_reap_zthr = zthr_create_timer("arc_reap",
|
2021-08-10 20:36:26 +03:00
|
|
|
arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1), minclsyspri);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_warm = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* Calculate maximum amount of dirty data per pool.
|
|
|
|
*
|
|
|
|
* If it has been set by a module parameter, take that.
|
|
|
|
* Otherwise, use a percentage of physical memory defined by
|
|
|
|
* zfs_dirty_data_max_percent (default 10%) with a cap at
|
2017-05-01 23:01:39 +03:00
|
|
|
* zfs_dirty_data_max_max (default 4G or 25% of physical memory).
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2020-08-01 07:30:31 +03:00
|
|
|
#ifdef __LP64__
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
if (zfs_dirty_data_max_max == 0)
|
2017-05-01 23:01:39 +03:00
|
|
|
zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024,
|
|
|
|
allmem * zfs_dirty_data_max_max_percent / 100);
|
2020-08-01 07:30:31 +03:00
|
|
|
#else
|
|
|
|
if (zfs_dirty_data_max_max == 0)
|
|
|
|
zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024,
|
|
|
|
allmem * zfs_dirty_data_max_max_percent / 100);
|
|
|
|
#endif
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
if (zfs_dirty_data_max == 0) {
|
2016-10-31 22:24:54 +03:00
|
|
|
zfs_dirty_data_max = allmem *
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
zfs_dirty_data_max_percent / 100;
|
|
|
|
zfs_dirty_data_max = MIN(zfs_dirty_data_max,
|
|
|
|
zfs_dirty_data_max_max);
|
|
|
|
}
|
2021-07-20 18:40:24 +03:00
|
|
|
|
|
|
|
if (zfs_wrlog_data_max == 0) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* dp_wrlog_total is reduced for each txg at the end of
|
|
|
|
* spa_sync(). However, dp_dirty_total is reduced every time
|
|
|
|
* a block is written out. Thus under normal operation,
|
|
|
|
* dp_wrlog_total could grow 2 times as big as
|
|
|
|
* zfs_dirty_data_max.
|
|
|
|
*/
|
|
|
|
zfs_wrlog_data_max = zfs_dirty_data_max * 2;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_fini(void)
|
|
|
|
{
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *p;
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
2019-10-18 20:23:19 +03:00
|
|
|
arc_lowmem_fini();
|
2011-03-30 05:08:59 +04:00
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* Use B_TRUE to ensure *all* buffers are evicted */
|
|
|
|
arc_flush(NULL, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
kstat_delete(arc_ksp);
|
|
|
|
arc_ksp = NULL;
|
|
|
|
}
|
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
taskq_wait(arc_prune_taskq);
|
|
|
|
taskq_destroy(arc_prune_taskq);
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
2023-06-09 20:12:52 +03:00
|
|
|
while ((p = list_remove_head(&arc_prune_list)) != NULL) {
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_remove(&p->p_refcnt, &arc_prune_list);
|
|
|
|
zfs_refcount_destroy(&p->p_refcnt);
|
2011-12-23 00:20:43 +04:00
|
|
|
kmem_free(p, sizeof (*p));
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
list_destroy(&arc_prune_list);
|
|
|
|
mutex_destroy(&arc_prune_mtx);
|
2017-03-16 02:41:52 +03:00
|
|
|
|
2020-07-22 19:51:47 +03:00
|
|
|
(void) zthr_cancel(arc_evict_zthr);
|
2017-03-16 02:41:52 +03:00
|
|
|
(void) zthr_cancel(arc_reap_zthr);
|
Avoid memory allocations in the ARC eviction thread
When the eviction thread goes to shrink an ARC state, it allocates a set
of marker buffers used to hold its place in the state's sublists.
This can be problematic in low memory conditions, since
1) the allocation can be substantial, as we allocate NCPU markers;
2) on at least FreeBSD, page reclamation can block in
arc_wait_for_eviction()
In particular, in stress tests it's possible to hit a deadlock on
FreeBSD when the number of free pages is very low, wherein the system is
waiting for the page daemon to reclaim memory, the page daemon is
waiting for the ARC eviction thread to finish, and the ARC eviction
thread is blocked waiting for more memory.
Try to reduce the likelihood of such deadlocks by pre-allocating markers
for the eviction thread at ARC initialization time. When evicting
buffers from an ARC state, check to see if the current thread is the ARC
eviction thread, and use the pre-allocated markers for that purpose
rather than dynamically allocating them.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #12985
2022-01-21 21:28:13 +03:00
|
|
|
arc_state_free_markers(arc_state_evict_markers,
|
|
|
|
arc_state_evict_marker_count);
|
2017-03-16 02:41:52 +03:00
|
|
|
|
2020-07-22 19:51:47 +03:00
|
|
|
mutex_destroy(&arc_evict_lock);
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
list_destroy(&arc_evict_waiters);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2020-08-19 08:11:34 +03:00
|
|
|
/*
|
|
|
|
* Free any buffers that were tagged for destruction. This needs
|
|
|
|
* to occur before arc_state_fini() runs and destroys the aggsum
|
|
|
|
* values which are updated when freeing scatter ABDs.
|
|
|
|
*/
|
|
|
|
l2arc_do_free_on_write();
|
|
|
|
|
2018-05-23 20:32:31 +03:00
|
|
|
/*
|
|
|
|
* buf_fini() must proceed arc_state_fini() because buf_fin() may
|
|
|
|
* trigger the release of kmem magazines, which can callback to
|
|
|
|
* arc_space_return() which accesses aggsums freed in act_state_fini().
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_fini();
|
2018-05-23 20:32:31 +03:00
|
|
|
arc_state_fini();
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2020-12-11 01:09:23 +03:00
|
|
|
arc_unregister_hotplug();
|
|
|
|
|
2019-07-18 22:55:29 +03:00
|
|
|
/*
|
|
|
|
* We destroy the zthrs after all the ARC state has been
|
|
|
|
* torn down to avoid the case of them receiving any
|
|
|
|
* wakeup() signals after they are destroyed.
|
|
|
|
*/
|
2020-07-22 19:51:47 +03:00
|
|
|
zthr_destroy(arc_evict_zthr);
|
2019-07-18 22:55:29 +03:00
|
|
|
zthr_destroy(arc_reap_zthr);
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT0(arc_loaned_bytes);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*
|
|
|
|
* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
|
|
|
|
* It uses dedicated storage devices to hold cached data, which are populated
|
|
|
|
* using large infrequent writes. The main role of this cache is to boost
|
|
|
|
* the performance of random read workloads. The intended L2ARC devices
|
|
|
|
* include short-stroked disks, solid state disks, and other media with
|
|
|
|
* substantially faster read latency than disk.
|
|
|
|
*
|
|
|
|
* +-----------------------+
|
|
|
|
* | ARC |
|
|
|
|
* +-----------------------+
|
|
|
|
* | ^ ^
|
|
|
|
* | | |
|
|
|
|
* l2arc_feed_thread() arc_read()
|
|
|
|
* | | |
|
|
|
|
* | l2arc read |
|
|
|
|
* V | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | L2ARC | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | ^ |
|
|
|
|
* l2arc_write() | |
|
|
|
|
* | | |
|
|
|
|
* V | |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* | vdev | | vdev |
|
|
|
|
* | cache | | cache |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* +=========+ .-----.
|
|
|
|
* : L2ARC : |-_____-|
|
|
|
|
* : devices : | Disks |
|
|
|
|
* +=========+ `-_____-'
|
|
|
|
*
|
|
|
|
* Read requests are satisfied from the following sources, in order:
|
|
|
|
*
|
|
|
|
* 1) ARC
|
|
|
|
* 2) vdev cache of L2ARC devices
|
|
|
|
* 3) L2ARC devices
|
|
|
|
* 4) vdev cache of disks
|
|
|
|
* 5) disks
|
|
|
|
*
|
|
|
|
* Some L2ARC device types exhibit extremely slow write performance.
|
|
|
|
* To accommodate for this there are some significant differences between
|
|
|
|
* the L2ARC and traditional cache design:
|
|
|
|
*
|
|
|
|
* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
|
|
|
|
* the ARC behave as usual, freeing buffers and placing headers on ghost
|
|
|
|
* lists. The ARC does not send buffers to the L2ARC during eviction as
|
|
|
|
* this would add inflated write latencies for all ARC memory pressure.
|
|
|
|
*
|
|
|
|
* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
|
|
|
|
* It does this by periodically scanning buffers from the eviction-end of
|
|
|
|
* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
|
2013-08-02 00:02:10 +04:00
|
|
|
* not already there. It scans until a headroom of buffers is satisfied,
|
|
|
|
* which itself is a buffer for ARC eviction. If a compressible buffer is
|
|
|
|
* found during scanning and selected for writing to an L2ARC device, we
|
|
|
|
* temporarily boost scanning headroom during the next scan cycle to make
|
|
|
|
* sure we adapt to compression effects (which might significantly reduce
|
|
|
|
* the data volume we write to L2ARC). The thread that does this is
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_thread(), illustrated below; example sizes are included to
|
|
|
|
* provide a better sense of ratio than this diagram:
|
|
|
|
*
|
|
|
|
* head --> tail
|
|
|
|
* +---------------------+----------+
|
|
|
|
* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
|
|
|
|
* +---------------------+----------+ | o L2ARC eligible
|
|
|
|
* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
|
|
|
|
* +---------------------+----------+ |
|
|
|
|
* 15.9 Gbytes ^ 32 Mbytes |
|
|
|
|
* headroom |
|
|
|
|
* l2arc_feed_thread()
|
|
|
|
* |
|
|
|
|
* l2arc write hand <--[oooo]--'
|
|
|
|
* | 8 Mbyte
|
|
|
|
* | write max
|
|
|
|
* V
|
|
|
|
* +==============================+
|
|
|
|
* L2ARC dev |####|#|###|###| |####| ... |
|
|
|
|
* +==============================+
|
|
|
|
* 32 Gbytes
|
|
|
|
*
|
|
|
|
* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
|
|
|
|
* evicted, then the L2ARC has cached a buffer much sooner than it probably
|
|
|
|
* needed to, potentially wasting L2ARC device bandwidth and storage. It is
|
|
|
|
* safe to say that this is an uncommon case, since buffers at the end of
|
|
|
|
* the ARC lists have moved there due to inactivity.
|
|
|
|
*
|
|
|
|
* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
|
|
|
|
* then the L2ARC simply misses copying some buffers. This serves as a
|
|
|
|
* pressure valve to prevent heavy read workloads from both stalling the ARC
|
|
|
|
* with waits and clogging the L2ARC with writes. This also helps prevent
|
|
|
|
* the potential for the L2ARC to churn if it attempts to cache content too
|
|
|
|
* quickly, such as during backups of the entire pool.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 5. After system boot and before the ARC has filled main memory, there are
|
|
|
|
* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
|
|
|
|
* lists can remain mostly static. Instead of searching from tail of these
|
|
|
|
* lists as pictured, the l2arc_feed_thread() will search from the list heads
|
|
|
|
* for eligible buffers, greatly increasing its chance of finding them.
|
|
|
|
*
|
|
|
|
* The L2ARC device write speed is also boosted during this time so that
|
|
|
|
* the L2ARC warms up faster. Since there have been no ARC evictions yet,
|
|
|
|
* there are no L2ARC reads, and no fear of degrading read performance
|
|
|
|
* through increased writes.
|
|
|
|
*
|
|
|
|
* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
|
2008-11-20 23:01:55 +03:00
|
|
|
* the vdev queue can aggregate them into larger and fewer writes. Each
|
|
|
|
* device is written to in a rotor fashion, sweeping writes through
|
|
|
|
* available space then repeating.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 7. The L2ARC does not store dirty content. It never needs to flush
|
2008-11-20 23:01:55 +03:00
|
|
|
* write buffers back to disk based storage.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 8. If an ARC buffer is written (and dirtied) which also exists in the
|
2008-11-20 23:01:55 +03:00
|
|
|
* L2ARC, the now stale L2ARC buffer is immediately dropped.
|
|
|
|
*
|
|
|
|
* The performance of the L2ARC can be tweaked by a number of tunables, which
|
|
|
|
* may be necessary for different workloads:
|
|
|
|
*
|
|
|
|
* l2arc_write_max max write bytes per interval
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc_write_boost extra write bytes during device warmup
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_noprefetch skip caching prefetched buffers
|
|
|
|
* l2arc_headroom number of max device writes to precache
|
2013-08-02 00:02:10 +04:00
|
|
|
* l2arc_headroom_boost when we find compressed buffers during ARC
|
|
|
|
* scanning, we multiply headroom by this
|
|
|
|
* percentage factor for the next scan cycle,
|
|
|
|
* since more compressed buffers are likely to
|
|
|
|
* be present
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_secs seconds between L2ARC writing
|
|
|
|
*
|
|
|
|
* Tunables may be removed or added as future performance improvements are
|
|
|
|
* integrated, and also may become zpool properties.
|
2009-02-18 23:51:31 +03:00
|
|
|
*
|
|
|
|
* There are three key functions that control how the L2ARC warms up:
|
|
|
|
*
|
|
|
|
* l2arc_write_eligible() check if a buffer is eligible to cache
|
|
|
|
* l2arc_write_size() calculate how much to write
|
|
|
|
* l2arc_write_interval() calculate sleep delay between writes
|
|
|
|
*
|
|
|
|
* These three functions determine what to write, how much, and how quickly
|
|
|
|
* to send writes.
|
2020-04-10 20:33:35 +03:00
|
|
|
*
|
|
|
|
* L2ARC persistence:
|
|
|
|
*
|
|
|
|
* When writing buffers to L2ARC, we periodically add some metadata to
|
|
|
|
* make sure we can pick them up after reboot, thus dramatically reducing
|
|
|
|
* the impact that any downtime has on the performance of storage systems
|
|
|
|
* with large caches.
|
|
|
|
*
|
|
|
|
* The implementation works fairly simply by integrating the following two
|
|
|
|
* modifications:
|
|
|
|
*
|
|
|
|
* *) When writing to the L2ARC, we occasionally write a "l2arc log block",
|
|
|
|
* which is an additional piece of metadata which describes what's been
|
|
|
|
* written. This allows us to rebuild the arc_buf_hdr_t structures of the
|
|
|
|
* main ARC buffers. There are 2 linked-lists of log blocks headed by
|
|
|
|
* dh_start_lbps[2]. We alternate which chain we append to, so they are
|
|
|
|
* time-wise and offset-wise interleaved, but that is an optimization rather
|
|
|
|
* than for correctness. The log block also includes a pointer to the
|
|
|
|
* previous block in its chain.
|
|
|
|
*
|
|
|
|
* *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device
|
|
|
|
* for our header bookkeeping purposes. This contains a device header,
|
|
|
|
* which contains our top-level reference structures. We update it each
|
|
|
|
* time we write a new log block, so that we're able to locate it in the
|
|
|
|
* L2ARC device. If this write results in an inconsistent device header
|
|
|
|
* (e.g. due to power failure), we detect this by verifying the header's
|
|
|
|
* checksum and simply fail to reconstruct the L2ARC after reboot.
|
|
|
|
*
|
|
|
|
* Implementation diagram:
|
|
|
|
*
|
|
|
|
* +=== L2ARC device (not to scale) ======================================+
|
|
|
|
* | ___two newest log block pointers__.__________ |
|
|
|
|
* | / \dh_start_lbps[1] |
|
|
|
|
* | / \ \dh_start_lbps[0]|
|
|
|
|
* |.___/__. V V |
|
|
|
|
* ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|
|
|
|
|
* || hdr| ^ /^ /^ / / |
|
|
|
|
* |+------+ ...--\-------/ \-----/--\------/ / |
|
|
|
|
* | \--------------/ \--------------/ |
|
|
|
|
* +======================================================================+
|
|
|
|
*
|
|
|
|
* As can be seen on the diagram, rather than using a simple linked list,
|
|
|
|
* we use a pair of linked lists with alternating elements. This is a
|
|
|
|
* performance enhancement due to the fact that we only find out the
|
|
|
|
* address of the next log block access once the current block has been
|
|
|
|
* completely read in. Obviously, this hurts performance, because we'd be
|
|
|
|
* keeping the device's I/O queue at only a 1 operation deep, thus
|
|
|
|
* incurring a large amount of I/O round-trip latency. Having two lists
|
|
|
|
* allows us to fetch two log blocks ahead of where we are currently
|
|
|
|
* rebuilding L2ARC buffers.
|
|
|
|
*
|
|
|
|
* On-device data structures:
|
|
|
|
*
|
|
|
|
* L2ARC device header: l2arc_dev_hdr_phys_t
|
|
|
|
* L2ARC log block: l2arc_log_blk_phys_t
|
|
|
|
*
|
|
|
|
* L2ARC reconstruction:
|
|
|
|
*
|
|
|
|
* When writing data, we simply write in the standard rotary fashion,
|
|
|
|
* evicting buffers as we go and simply writing new data over them (writing
|
|
|
|
* a new log block every now and then). This obviously means that once we
|
|
|
|
* loop around the end of the device, we will start cutting into an already
|
|
|
|
* committed log block (and its referenced data buffers), like so:
|
|
|
|
*
|
|
|
|
* current write head__ __old tail
|
|
|
|
* \ /
|
|
|
|
* V V
|
|
|
|
* <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |-->
|
|
|
|
* ^ ^^^^^^^^^___________________________________
|
|
|
|
* | \
|
|
|
|
* <<nextwrite>> may overwrite this blk and/or its bufs --'
|
|
|
|
*
|
|
|
|
* When importing the pool, we detect this situation and use it to stop
|
|
|
|
* our scanning process (see l2arc_rebuild).
|
|
|
|
*
|
|
|
|
* There is one significant caveat to consider when rebuilding ARC contents
|
|
|
|
* from an L2ARC device: what about invalidated buffers? Given the above
|
|
|
|
* construction, we cannot update blocks which we've already written to amend
|
|
|
|
* them to remove buffers which were invalidated. Thus, during reconstruction,
|
|
|
|
* we might be populating the cache with buffers for data that's not on the
|
|
|
|
* main pool anymore, or may have been overwritten!
|
|
|
|
*
|
|
|
|
* As it turns out, this isn't a problem. Every arc_read request includes
|
|
|
|
* both the DVA and, crucially, the birth TXG of the BP the caller is
|
|
|
|
* looking for. So even if the cache were populated by completely rotten
|
|
|
|
* blocks for data that had been long deleted and/or overwritten, we'll
|
|
|
|
* never actually return bad data from the cache, since the DVA with the
|
|
|
|
* birth TXG uniquely identify a block in space and time - once created,
|
|
|
|
* a block is immutable on disk. The worst thing we have done is wasted
|
|
|
|
* some time and memory at l2arc rebuild to reconstruct outdated ARC
|
|
|
|
* entries that will get dropped from the l2arc as it is being updated
|
|
|
|
* with new blocks.
|
|
|
|
*
|
|
|
|
* L2ARC buffers that have been evicted by l2arc_evict() ahead of the write
|
|
|
|
* hand are not restored. This is done by saving the offset (in bytes)
|
|
|
|
* l2arc_evict() has evicted to in the L2ARC device header and taking it
|
|
|
|
* into account when restoring buffers.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
static boolean_t
|
2014-12-06 20:24:32 +03:00
|
|
|
l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A buffer is *not* eligible for the L2ARC if it:
|
|
|
|
* 1. belongs to a different spa.
|
2010-05-29 00:45:14 +04:00
|
|
|
* 2. is already cached on the L2ARC.
|
|
|
|
* 3. has an I/O in progress (it may be an incomplete read).
|
|
|
|
* 4. is flagged not eligible (zfs property).
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
|
2020-09-23 02:08:05 +03:00
|
|
|
HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
|
2009-02-18 23:51:31 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
2020-03-31 20:46:48 +03:00
|
|
|
l2arc_write_size(l2arc_dev_t *dev)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
2023-06-06 22:32:37 +03:00
|
|
|
uint64_t size;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* Make sure our globals have meaningful values in case the user
|
|
|
|
* altered them.
|
|
|
|
*/
|
|
|
|
size = l2arc_write_max;
|
|
|
|
if (size == 0) {
|
2023-11-15 00:47:57 +03:00
|
|
|
cmn_err(CE_NOTE, "l2arc_write_max must be greater than zero, "
|
|
|
|
"resetting it to the default (%d)", L2ARC_WRITE_SIZE);
|
2013-08-02 00:02:10 +04:00
|
|
|
size = l2arc_write_max = L2ARC_WRITE_SIZE;
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
if (arc_warm == B_FALSE)
|
2013-08-02 00:02:10 +04:00
|
|
|
size += l2arc_write_boost;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2023-05-09 18:54:41 +03:00
|
|
|
/* We need to add in the worst case scenario of log block overhead. */
|
2023-06-06 22:32:37 +03:00
|
|
|
size += l2arc_log_blk_overhead(size, dev);
|
2023-05-09 18:54:41 +03:00
|
|
|
if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0) {
|
|
|
|
/*
|
|
|
|
* Trim ahead of the write size 64MB or (l2arc_trim_ahead/100)
|
|
|
|
* times the writesize, whichever is greater.
|
|
|
|
*/
|
2023-06-06 22:32:37 +03:00
|
|
|
size += MAX(64 * 1024 * 1024,
|
|
|
|
(size * l2arc_trim_ahead) / 100);
|
2023-05-09 18:54:41 +03:00
|
|
|
}
|
2020-06-09 20:15:08 +03:00
|
|
|
|
2023-06-06 22:32:37 +03:00
|
|
|
/*
|
|
|
|
* Make sure the write size does not exceed the size of the cache
|
|
|
|
* device. This is important in l2arc_evict(), otherwise infinite
|
|
|
|
* iteration can occur.
|
|
|
|
*/
|
2023-11-15 00:47:57 +03:00
|
|
|
size = MIN(size, (dev->l2ad_end - dev->l2ad_start) / 4);
|
2023-06-06 22:32:37 +03:00
|
|
|
|
2023-11-15 00:47:57 +03:00
|
|
|
size = P2ROUNDUP(size, 1ULL << dev->l2ad_vdev->vdev_ashift);
|
2020-03-31 20:46:48 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
return (size);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static clock_t
|
|
|
|
l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t interval, next, now;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ARC lists are busy, increase our write rate; if the
|
|
|
|
* lists are stale, idle back. This is achieved by checking
|
|
|
|
* how much we previously wrote - if it was more than half of
|
|
|
|
* what we wanted, schedule the next write much sooner.
|
|
|
|
*/
|
|
|
|
if (l2arc_feed_again && wrote > (wanted / 2))
|
|
|
|
interval = (hz * l2arc_feed_min_ms) / 1000;
|
|
|
|
else
|
|
|
|
interval = hz * l2arc_feed_secs;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
now = ddi_get_lbolt();
|
|
|
|
next = MAX(now, MIN(now + interval, began + interval));
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Cycle through L2ARC devices. This is how L2ARC load balances.
|
2008-12-03 23:09:06 +03:00
|
|
|
* If a device is returned, this also returns holding the spa config lock.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static l2arc_dev_t *
|
|
|
|
l2arc_dev_get_next(void)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_dev_t *first, *next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Lock out the removal of spas (spa_namespace_lock), then removal
|
|
|
|
* of cache devices (l2arc_dev_mtx). Once a device has been selected,
|
|
|
|
* both locks will be dropped and a spa config lock held instead.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/* if there are no vdevs, there is nothing to do */
|
|
|
|
if (l2arc_ndev == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
first = NULL;
|
|
|
|
next = l2arc_dev_last;
|
|
|
|
do {
|
|
|
|
/* loop around the list looking for a non-faulted vdev */
|
|
|
|
if (next == NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
next = list_head(l2arc_dev_list);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
next = list_next(l2arc_dev_list, next);
|
|
|
|
if (next == NULL)
|
|
|
|
next = list_head(l2arc_dev_list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we have come back to the start, bail out */
|
|
|
|
if (first == NULL)
|
|
|
|
first = next;
|
|
|
|
else if (next == first)
|
|
|
|
break;
|
|
|
|
|
2022-10-12 21:25:18 +03:00
|
|
|
ASSERT3P(next, !=, NULL);
|
2020-06-09 20:15:08 +03:00
|
|
|
} while (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild ||
|
|
|
|
next->l2ad_trim_all);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* if we were unable to find any usable vdevs, return NULL */
|
2020-06-09 20:15:08 +03:00
|
|
|
if (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild ||
|
|
|
|
next->l2ad_trim_all)
|
2008-12-03 23:09:06 +03:00
|
|
|
next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
l2arc_dev_last = next;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
out:
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Grab the config lock to prevent the 'next' device from being
|
|
|
|
* removed while we are writing to it.
|
|
|
|
*/
|
|
|
|
if (next != NULL)
|
|
|
|
spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Free buffers that were tagged for destruction.
|
|
|
|
*/
|
|
|
|
static void
|
2010-08-26 20:52:41 +04:00
|
|
|
l2arc_do_free_on_write(void)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2023-06-09 20:12:52 +03:00
|
|
|
l2arc_data_free_t *df;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
2023-06-09 20:12:52 +03:00
|
|
|
while ((df = list_remove_head(l2arc_free_on_write)) != NULL) {
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(df->l2df_abd, !=, NULL);
|
|
|
|
abd_free(df->l2df_abd);
|
2008-12-03 23:09:06 +03:00
|
|
|
kmem_free(df, sizeof (l2arc_data_free_t));
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* A write to a cache device has completed. Update all headers to allow
|
|
|
|
* reads from these buffers to begin.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_write_done(zio_t *zio)
|
|
|
|
{
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_write_callback_t *cb;
|
|
|
|
l2arc_lb_abd_buf_t *abd_buf;
|
|
|
|
l2arc_lb_ptr_buf_t *lb_ptr_buf;
|
|
|
|
l2arc_dev_t *dev;
|
2020-05-08 02:34:03 +03:00
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr;
|
2020-04-10 20:33:35 +03:00
|
|
|
list_t *buflist;
|
|
|
|
arc_buf_hdr_t *head, *hdr, *hdr_prev;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
int64_t bytes_dropped = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = zio->io_private;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(cb, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
dev = cb->l2wcb_dev;
|
2020-05-08 02:34:03 +03:00
|
|
|
l2dhdr = dev->l2ad_dev_hdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(dev, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
head = cb->l2wcb_head;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(head, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
buflist = &dev->l2ad_buflist;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buflist, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
|
|
|
|
l2arc_write_callback_t *, cb);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All writes completed, or an error was hit.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
top:
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
|
|
|
|
hdr_prev = list_prev(buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot use mutex_enter or else we can deadlock
|
|
|
|
* with l2arc_write_buffers (due to swapping the order
|
|
|
|
* the hash lock and l2ad_mtx are taken).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Missed the hash lock. We must retry so we
|
|
|
|
* don't leave the ARC_FLAG_L2_WRITING bit set.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to rescan the headers we've
|
|
|
|
* already marked as having been written out, so
|
|
|
|
* we reinsert the head node so we can pick up
|
|
|
|
* where we left off.
|
|
|
|
*/
|
|
|
|
list_remove(buflist, head);
|
|
|
|
list_insert_after(buflist, hdr, head);
|
|
|
|
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We wait for the hash lock to become available
|
|
|
|
* to try and prevent busy waiting, and increase
|
|
|
|
* the chance we'll be able to acquire the lock
|
|
|
|
* the next time around.
|
|
|
|
*/
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* We could not have been moved into the arc_l2c_only
|
|
|
|
* state while in-flight due to our ARC_FLAG_L2_WRITING
|
|
|
|
* bit being set. Let's just ensure that's being enforced.
|
|
|
|
*/
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
2016-02-10 21:42:01 +03:00
|
|
|
/*
|
|
|
|
* Skipped - drop L2ARC entry and mark the header as no
|
|
|
|
* longer L2 eligibile.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zio->io_error != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Error - drop L2ARC entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
list_remove(buflist, hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2019-01-31 20:16:39 +03:00
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
l2arc_hdr_arcstats_decrement(hdr);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2019-01-31 20:16:39 +03:00
|
|
|
bytes_dropped +=
|
|
|
|
vdev_psize_to_asize(dev->l2ad_vdev, psize);
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_remove_many(&dev->l2ad_alloc,
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Allow ARC to begin reads and ghost list evictions to
|
|
|
|
* this L2ARC entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Free the allocated abd buffers for writing the log blocks.
|
|
|
|
* If the zio failed reclaim the allocated space and remove the
|
|
|
|
* pointers to these log blocks from the log block pointer list
|
|
|
|
* of the L2ARC device.
|
|
|
|
*/
|
|
|
|
while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) {
|
|
|
|
abd_free(abd_buf->abd);
|
|
|
|
zio_buf_free(abd_buf, sizeof (*abd_buf));
|
|
|
|
if (zio->io_error != 0) {
|
|
|
|
lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list);
|
2020-05-08 02:34:03 +03:00
|
|
|
/*
|
|
|
|
* L2BLK_GET_PSIZE returns aligned size for log
|
|
|
|
* blocks.
|
|
|
|
*/
|
|
|
|
uint64_t asize =
|
2020-04-10 20:33:35 +03:00
|
|
|
L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop);
|
2020-05-08 02:34:03 +03:00
|
|
|
bytes_dropped += asize;
|
|
|
|
ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);
|
|
|
|
zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,
|
|
|
|
lb_ptr_buf);
|
|
|
|
zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf);
|
2020-04-10 20:33:35 +03:00
|
|
|
kmem_free(lb_ptr_buf->lb_ptr,
|
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
list_destroy(&cb->l2wcb_abd_list);
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
if (zio->io_error != 0) {
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_error);
|
|
|
|
|
2020-07-11 00:10:03 +03:00
|
|
|
/*
|
|
|
|
* Restore the lbps array in the header to its previous state.
|
|
|
|
* If the list of log block pointers is empty, zero out the
|
|
|
|
* log block pointers in the device header.
|
|
|
|
*/
|
2020-05-08 02:34:03 +03:00
|
|
|
lb_ptr_buf = list_head(&dev->l2ad_lbptr_list);
|
|
|
|
for (int i = 0; i < 2; i++) {
|
2020-07-11 00:10:03 +03:00
|
|
|
if (lb_ptr_buf == NULL) {
|
|
|
|
/*
|
|
|
|
* If the list is empty zero out the device
|
|
|
|
* header. Otherwise zero out the second log
|
|
|
|
* block pointer in the header.
|
|
|
|
*/
|
|
|
|
if (i == 0) {
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(l2dhdr, 0,
|
|
|
|
dev->l2ad_dev_hdr_asize);
|
2020-07-11 00:10:03 +03:00
|
|
|
} else {
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(&l2dhdr->dh_start_lbps[i], 0,
|
2020-07-11 00:10:03 +03:00
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(&l2dhdr->dh_start_lbps[i], lb_ptr_buf->lb_ptr,
|
2020-05-08 02:34:03 +03:00
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
lb_ptr_buf = list_next(&dev->l2ad_lbptr_list,
|
|
|
|
lb_ptr_buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-06-17 03:19:34 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_done);
|
2008-11-20 23:01:55 +03:00
|
|
|
list_remove(buflist, head);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_HAS_L1HDR(head));
|
|
|
|
kmem_cache_free(hdr_l2only_cache, head);
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
ASSERT(dev->l2ad_vdev != NULL);
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_do_free_on_write();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
static int
|
|
|
|
l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
spa_t *spa = zio->io_spa;
|
|
|
|
arc_buf_hdr_t *hdr = cb->l2rcb_hdr;
|
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
uint8_t salt[ZIO_DATA_SALT_LEN];
|
|
|
|
uint8_t iv[ZIO_DATA_IV_LEN];
|
|
|
|
uint8_t mac[ZIO_DATA_MAC_LEN];
|
|
|
|
boolean_t no_crypt = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL data is never be written to the L2ARC, so we don't need
|
|
|
|
* special handling for its unique MAC storage.
|
|
|
|
*/
|
|
|
|
ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
2017-09-28 18:49:13 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2017-09-28 18:49:13 +03:00
|
|
|
/*
|
|
|
|
* If the data was encrypted, decrypt it now. Note that
|
|
|
|
* we must check the bp here and not the hdr, since the
|
|
|
|
* hdr does not have its encryption parameters updated
|
|
|
|
* until arc_read_done().
|
|
|
|
*/
|
|
|
|
if (BP_IS_ENCRYPTED(bp)) {
|
2020-08-12 20:03:24 +03:00
|
|
|
abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ARC_HDR_USE_RESERVE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
zio_crypt_decode_params_bp(bp, salt, iv);
|
|
|
|
zio_crypt_decode_mac_bp(bp, mac);
|
|
|
|
|
2018-05-03 01:36:20 +03:00
|
|
|
ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb,
|
|
|
|
BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp),
|
|
|
|
salt, iv, mac, HDR_GET_PSIZE(hdr), eabd,
|
|
|
|
hdr->b_l1hdr.b_pabd, &no_crypt);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0) {
|
|
|
|
arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we actually performed decryption, replace b_pabd
|
|
|
|
* with the decrypted data. Otherwise we can just throw
|
|
|
|
* our decryption buffer away.
|
|
|
|
*/
|
|
|
|
if (!no_crypt) {
|
|
|
|
arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
hdr->b_l1hdr.b_pabd = eabd;
|
|
|
|
zio->io_abd = eabd;
|
|
|
|
} else {
|
|
|
|
arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the L2ARC block was compressed, but ARC compression
|
|
|
|
* is disabled we decompress the data into a new buffer and
|
|
|
|
* replace the existing data.
|
|
|
|
*/
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!HDR_COMPRESSION_ENABLED(hdr)) {
|
2020-08-12 20:03:24 +03:00
|
|
|
abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ARC_HDR_USE_RESERVE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
void *tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr));
|
|
|
|
|
|
|
|
ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),
|
|
|
|
hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr),
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
HDR_GET_LSIZE(hdr), &hdr->b_complevel);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0) {
|
|
|
|
abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
|
|
|
|
arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
|
|
|
|
abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr));
|
|
|
|
arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
hdr->b_l1hdr.b_pabd = cabd;
|
|
|
|
zio->io_abd = cabd;
|
|
|
|
zio->io_size = HDR_GET_LSIZE(hdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error:
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* A read to a cache device completed. Validate buffer contents before
|
|
|
|
* handing over to the regular ARC routines.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_read_done(zio_t *zio)
|
|
|
|
{
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int tfm_error = 0;
|
2018-06-06 20:17:50 +03:00
|
|
|
l2arc_read_callback_t *cb = zio->io_private;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
2018-06-06 20:17:50 +03:00
|
|
|
boolean_t valid_cksum;
|
|
|
|
boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) &&
|
|
|
|
(cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT));
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(zio->io_vd, !=, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
|
|
|
|
|
|
|
|
spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(cb, !=, NULL);
|
|
|
|
hdr = cb->l2rcb_hdr;
|
|
|
|
ASSERT3P(hdr, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-06-27 03:32:43 +03:00
|
|
|
/*
|
|
|
|
* If the data was read into a temporary buffer,
|
|
|
|
* move it and free the buffer.
|
|
|
|
*/
|
|
|
|
if (cb->l2rcb_abd != NULL) {
|
|
|
|
ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);
|
|
|
|
if (zio->io_error == 0) {
|
2018-06-06 20:17:50 +03:00
|
|
|
if (using_rdata) {
|
|
|
|
abd_copy(hdr->b_crypt_hdr.b_rabd,
|
|
|
|
cb->l2rcb_abd, arc_hdr_size(hdr));
|
|
|
|
} else {
|
|
|
|
abd_copy(hdr->b_l1hdr.b_pabd,
|
|
|
|
cb->l2rcb_abd, arc_hdr_size(hdr));
|
|
|
|
}
|
2017-06-27 03:32:43 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The following must be done regardless of whether
|
|
|
|
* there was an error:
|
|
|
|
* - free the temporary buffer
|
|
|
|
* - point zio to the real ARC buffer
|
|
|
|
* - set zio size accordingly
|
|
|
|
* These are required because zio is either re-used for
|
|
|
|
* an I/O of the block in the case of the error
|
|
|
|
* or the zio is passed to arc_read_done() and it
|
|
|
|
* needs real data.
|
|
|
|
*/
|
|
|
|
abd_free(cb->l2rcb_abd);
|
|
|
|
zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);
|
2017-09-28 18:49:13 +03:00
|
|
|
|
2018-06-06 20:17:50 +03:00
|
|
|
if (using_rdata) {
|
2017-09-28 18:49:13 +03:00
|
|
|
ASSERT(HDR_HAS_RABD(hdr));
|
|
|
|
zio->io_abd = zio->io_orig_abd =
|
|
|
|
hdr->b_crypt_hdr.b_rabd;
|
|
|
|
} else {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);
|
|
|
|
zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;
|
|
|
|
}
|
2017-06-27 03:32:43 +03:00
|
|
|
}
|
|
|
|
|
2016-07-22 18:52:49 +03:00
|
|
|
ASSERT3P(zio->io_abd, !=, NULL);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Check this survived the L2ARC journey.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd ||
|
|
|
|
(HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd));
|
2016-06-02 07:04:53 +03:00
|
|
|
zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
|
|
|
|
zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
zio->io_prop.zp_complevel = hdr->b_complevel;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
valid_cksum = arc_cksum_is_equal(hdr, zio);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* b_rabd will always match the data as it exists on disk if it is
|
|
|
|
* being used. Therefore if we are reading into b_rabd we do not
|
|
|
|
* attempt to untransform the data.
|
|
|
|
*/
|
|
|
|
if (valid_cksum && !using_rdata)
|
|
|
|
tfm_error = l2arc_untransform(zio, cb);
|
|
|
|
|
|
|
|
if (valid_cksum && tfm_error == 0 && zio->io_error == 0 &&
|
|
|
|
!HDR_L2_EVICTED(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
zio->io_private = hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_read_done(zio);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Buffer didn't survive caching. Increment stats and
|
|
|
|
* reissue to the original storage device.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio->io_error != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_io_error);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2013-03-08 22:41:28 +04:00
|
|
|
zio->io_error = SET_ERROR(EIO);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (!valid_cksum || tfm_error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_cksum_bad);
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If there's no waiter, issue an async i/o to the primary
|
|
|
|
* storage now. If there *is* a waiter, the caller must
|
|
|
|
* issue the i/o in a context where it's OK to block.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
if (zio->io_waiter == NULL) {
|
|
|
|
zio_t *pio = zio_unique_parent(zio);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
void *abd = (using_rdata) ?
|
|
|
|
hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
|
|
|
|
|
2019-12-03 20:59:30 +03:00
|
|
|
zio = zio_read(pio, zio->io_spa, zio->io_bp,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd, zio->io_size, arc_read_done,
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr, zio->io_priority, cb->l2rcb_flags,
|
2019-12-03 20:59:30 +03:00
|
|
|
&cb->l2rcb_zb);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Original ZIO will be freed, so we need to update
|
|
|
|
* ARC header with the new ZIO pointer to be used
|
|
|
|
* by zio_change_priority() in arc_read().
|
|
|
|
*/
|
|
|
|
for (struct arc_callback *acb = hdr->b_l1hdr.b_acb;
|
|
|
|
acb != NULL; acb = acb->acb_next)
|
|
|
|
acb->acb_zio_head = zio;
|
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
zio_nowait(zio);
|
|
|
|
} else {
|
|
|
|
mutex_exit(hash_lock);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_read_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is the list priority from which the L2ARC will search for pages to
|
|
|
|
* cache. This is used within loops (0..3) to cycle through lists in the
|
|
|
|
* desired order. This order can have a significant effect on cache
|
|
|
|
* performance.
|
|
|
|
*
|
|
|
|
* Currently the metadata lists are hit first, MFU then MRU, followed by
|
|
|
|
* the data lists. This function returns a locked list, and also returns
|
|
|
|
* the lock pointer.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
static multilist_sublist_t *
|
|
|
|
l2arc_sublist_lock(int list_num)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_t *ml = NULL;
|
|
|
|
unsigned int idx;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-11-01 02:04:01 +03:00
|
|
|
ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
switch (list_num) {
|
|
|
|
case 0:
|
2021-06-10 19:42:31 +03:00
|
|
|
ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 1:
|
2021-06-10 19:42:31 +03:00
|
|
|
ml = &arc_mru->arcs_list[ARC_BUFC_METADATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 2:
|
2021-06-10 19:42:31 +03:00
|
|
|
ml = &arc_mfu->arcs_list[ARC_BUFC_DATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 3:
|
2021-06-10 19:42:31 +03:00
|
|
|
ml = &arc_mru->arcs_list[ARC_BUFC_DATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
2016-11-01 02:04:01 +03:00
|
|
|
default:
|
|
|
|
return (NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Return a randomly-selected sublist. This is acceptable
|
|
|
|
* because the caller feeds only a little bit of data for each
|
|
|
|
* call (8MB). Subsequent calls will result in different
|
|
|
|
* sublists being selected.
|
|
|
|
*/
|
|
|
|
idx = multilist_get_random_index(ml);
|
|
|
|
return (multilist_sublist_lock(ml, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Calculates the maximum overhead of L2ARC metadata log blocks for a given
|
2020-05-08 02:34:03 +03:00
|
|
|
* L2ARC write size. l2arc_evict and l2arc_write_size need to include this
|
2020-04-10 20:33:35 +03:00
|
|
|
* overhead in processing to make sure there is enough headroom available
|
|
|
|
* when writing buffers.
|
|
|
|
*/
|
|
|
|
static inline uint64_t
|
|
|
|
l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev)
|
|
|
|
{
|
2020-05-08 02:34:03 +03:00
|
|
|
if (dev->l2ad_log_entries == 0) {
|
2020-04-10 20:33:35 +03:00
|
|
|
return (0);
|
|
|
|
} else {
|
|
|
|
uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT;
|
|
|
|
|
|
|
|
uint64_t log_blocks = (log_entries +
|
2020-05-08 02:34:03 +03:00
|
|
|
dev->l2ad_log_entries - 1) /
|
|
|
|
dev->l2ad_log_entries;
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
return (vdev_psize_to_asize(dev->l2ad_vdev,
|
|
|
|
sizeof (l2arc_log_blk_phys_t)) * log_blocks);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Evict buffers from the device write hand to the distance specified in
|
2020-04-10 20:33:35 +03:00
|
|
|
* bytes. This distance may span populated buffers, it may span nothing.
|
2008-11-20 23:01:55 +03:00
|
|
|
* This is clearing a region on the L2ARC device ready for writing.
|
|
|
|
* If the 'all' boolean is set, every buffer is evicted.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
|
|
|
|
{
|
|
|
|
list_t *buflist;
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr, *hdr_prev;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t taddr;
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev;
|
2020-06-09 20:15:08 +03:00
|
|
|
vdev_t *vd = dev->l2ad_vdev;
|
|
|
|
boolean_t rerun;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
buflist = &dev->l2ad_buflist;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-03-31 20:46:48 +03:00
|
|
|
top:
|
|
|
|
rerun = B_FALSE;
|
2023-06-10 03:05:47 +03:00
|
|
|
if (dev->l2ad_hand + distance > dev->l2ad_end) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2020-06-10 07:24:09 +03:00
|
|
|
* When there is no space to accommodate upcoming writes,
|
2020-04-10 20:33:35 +03:00
|
|
|
* evict to the end. Then bump the write and evict hands
|
|
|
|
* to the start and iterate. This iteration does not
|
|
|
|
* happen indefinitely as we make sure in
|
|
|
|
* l2arc_write_size() that when the write hand is reset,
|
|
|
|
* the write size does not exceed the end of the device.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2020-03-31 20:46:48 +03:00
|
|
|
rerun = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
taddr = dev->l2ad_end;
|
|
|
|
} else {
|
|
|
|
taddr = dev->l2ad_hand + distance;
|
|
|
|
}
|
|
|
|
DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
|
|
|
|
uint64_t, taddr, boolean_t, all);
|
|
|
|
|
2020-06-09 20:15:08 +03:00
|
|
|
if (!all) {
|
2020-03-31 20:46:48 +03:00
|
|
|
/*
|
2020-06-09 20:15:08 +03:00
|
|
|
* This check has to be placed after deciding whether to
|
|
|
|
* iterate (rerun).
|
2020-03-31 20:46:48 +03:00
|
|
|
*/
|
2020-06-09 20:15:08 +03:00
|
|
|
if (dev->l2ad_first) {
|
|
|
|
/*
|
|
|
|
* This is the first sweep through the device. There is
|
|
|
|
* nothing to evict. We have already trimmmed the
|
|
|
|
* whole device.
|
|
|
|
*/
|
|
|
|
goto out;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Trim the space to be evicted.
|
|
|
|
*/
|
|
|
|
if (vd->vdev_has_trim && dev->l2ad_evict < taddr &&
|
|
|
|
l2arc_trim_ahead > 0) {
|
|
|
|
/*
|
|
|
|
* We have to drop the spa_config lock because
|
|
|
|
* vdev_trim_range() will acquire it.
|
|
|
|
* l2ad_evict already accounts for the label
|
|
|
|
* size. To prevent vdev_trim_ranges() from
|
|
|
|
* adding it again, we subtract it from
|
|
|
|
* l2ad_evict.
|
|
|
|
*/
|
|
|
|
spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev);
|
|
|
|
vdev_trim_simple(vd,
|
|
|
|
dev->l2ad_evict - VDEV_LABEL_START_SIZE,
|
|
|
|
taddr - dev->l2ad_evict);
|
|
|
|
spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev,
|
|
|
|
RW_READER);
|
|
|
|
}
|
2020-03-31 20:46:48 +03:00
|
|
|
|
2020-06-09 20:15:08 +03:00
|
|
|
/*
|
|
|
|
* When rebuilding L2ARC we retrieve the evict hand
|
|
|
|
* from the header of the device. Of note, l2arc_evict()
|
|
|
|
* does not actually delete buffers from the cache
|
|
|
|
* device, but trimming may do so depending on the
|
|
|
|
* hardware implementation. Thus keeping track of the
|
|
|
|
* evict hand is useful.
|
|
|
|
*/
|
|
|
|
dev->l2ad_evict = MAX(dev->l2ad_evict, taddr);
|
|
|
|
}
|
|
|
|
}
|
2020-04-10 20:33:35 +03:00
|
|
|
|
2020-03-31 20:46:48 +03:00
|
|
|
retry:
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* We have to account for evicted log blocks. Run vdev_space_update()
|
|
|
|
* on log blocks whose offset (in bytes) is before the evicted offset
|
|
|
|
* (in bytes) by searching in the list of pointers to log blocks
|
|
|
|
* present in the L2ARC device.
|
|
|
|
*/
|
|
|
|
for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf;
|
|
|
|
lb_ptr_buf = lb_ptr_buf_prev) {
|
|
|
|
|
|
|
|
lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf);
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
/* L2BLK_GET_PSIZE returns aligned size for log blocks */
|
|
|
|
uint64_t asize = L2BLK_GET_PSIZE(
|
|
|
|
(lb_ptr_buf->lb_ptr)->lbp_prop);
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* We don't worry about log blocks left behind (ie
|
2020-05-08 02:34:03 +03:00
|
|
|
* lbp_payload_start < l2ad_hand) because l2arc_write_buffers()
|
2020-04-10 20:33:35 +03:00
|
|
|
* will never write more than l2arc_evict() evicts.
|
|
|
|
*/
|
|
|
|
if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) {
|
|
|
|
break;
|
|
|
|
} else {
|
2020-06-09 20:15:08 +03:00
|
|
|
vdev_space_update(vd, -asize, 0, 0);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);
|
|
|
|
zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,
|
|
|
|
lb_ptr_buf);
|
|
|
|
zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf);
|
2020-04-10 20:33:35 +03:00
|
|
|
list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf);
|
|
|
|
kmem_free(lb_ptr_buf->lb_ptr,
|
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
|
|
|
|
hdr_prev = list_prev(buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-03-16 00:17:38 +03:00
|
|
|
ASSERT(!HDR_EMPTY(hdr));
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot use mutex_enter or else we can deadlock
|
|
|
|
* with l2arc_write_buffers (due to swapping the order
|
|
|
|
* the hash lock and l2ad_mtx are taken).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
|
|
|
* Missed the hash lock. Retry.
|
|
|
|
*/
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
2020-03-31 20:46:48 +03:00
|
|
|
goto retry;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-01 00:32:55 +03:00
|
|
|
/*
|
|
|
|
* A header can't be on this list if it doesn't have L2 header.
|
|
|
|
*/
|
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-01 00:32:55 +03:00
|
|
|
/* Ensure this header has finished being written. */
|
|
|
|
ASSERT(!HDR_L2_WRITING(hdr));
|
|
|
|
ASSERT(!HDR_L2_WRITE_HEAD(hdr));
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict ||
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We've evicted to the target address,
|
|
|
|
* or the end of the device.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (!HDR_HAS_L1HDR(hdr)) {
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(!HDR_L2_READING(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This doesn't exist in the ARC. Destroy.
|
|
|
|
* arc_hdr_destroy() will call list_remove()
|
2017-03-11 20:48:35 +03:00
|
|
|
* and decrement arcstat_l2_lsize.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2022-12-22 23:10:24 +03:00
|
|
|
arc_change_state(arc_anon, hdr);
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Invalidate issued or about to be issued
|
|
|
|
* reads, since we may be about to write
|
|
|
|
* over this location.
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
if (HDR_L2_READING(hdr)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_reading);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2020-03-31 20:46:48 +03:00
|
|
|
|
|
|
|
out:
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* We need to check if we evict all buffers, otherwise we may iterate
|
|
|
|
* unnecessarily.
|
|
|
|
*/
|
|
|
|
if (!all && rerun) {
|
2020-03-31 20:46:48 +03:00
|
|
|
/*
|
|
|
|
* Bump device hand to the device start if it is approaching the
|
|
|
|
* end. l2arc_evict() has already evicted ahead for this case.
|
|
|
|
*/
|
|
|
|
dev->l2ad_hand = dev->l2ad_start;
|
2020-04-10 20:33:35 +03:00
|
|
|
dev->l2ad_evict = dev->l2ad_start;
|
2020-03-31 20:46:48 +03:00
|
|
|
dev->l2ad_first = B_FALSE;
|
|
|
|
goto top;
|
|
|
|
}
|
2020-05-08 02:34:03 +03:00
|
|
|
|
2020-11-16 20:08:11 +03:00
|
|
|
if (!all) {
|
|
|
|
/*
|
|
|
|
* In case of cache device removal (all) the following
|
|
|
|
* assertions may be violated without functional consequences
|
|
|
|
* as the device is about to be removed.
|
|
|
|
*/
|
|
|
|
ASSERT3U(dev->l2ad_hand + distance, <, dev->l2ad_end);
|
|
|
|
if (!dev->l2ad_first)
|
2023-06-10 03:05:47 +03:00
|
|
|
ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);
|
2020-11-16 20:08:11 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Handle any abd transforms that might be required for writing to the L2ARC.
|
|
|
|
* If successful, this function will always return an abd with the data
|
|
|
|
* transformed as it is on disk in a new abd of asize bytes.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize,
|
|
|
|
abd_t **abd_out)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
void *tmp = NULL;
|
|
|
|
abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd;
|
|
|
|
enum zio_compress compress = HDR_GET_COMPRESS(hdr);
|
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
uint64_t size = arc_hdr_size(hdr);
|
|
|
|
boolean_t ismd = HDR_ISTYPE_METADATA(hdr);
|
|
|
|
boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);
|
|
|
|
dsl_crypto_key_t *dck = NULL;
|
|
|
|
uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 };
|
2017-09-12 23:15:11 +03:00
|
|
|
boolean_t no_crypt = B_FALSE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!HDR_COMPRESSION_ENABLED(hdr)) ||
|
|
|
|
HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize);
|
|
|
|
ASSERT3U(psize, <=, asize);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this data simply needs its own buffer, we simply allocate it
|
2019-06-06 23:14:48 +03:00
|
|
|
* and copy the data. This may be done to eliminate a dependency on a
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* shared buffer or to reallocate the buffer to match asize.
|
|
|
|
*/
|
2017-09-12 23:15:11 +03:00
|
|
|
if (HDR_HAS_RABD(hdr) && asize != psize) {
|
2018-04-01 01:14:21 +03:00
|
|
|
ASSERT3U(asize, >=, psize);
|
2017-09-12 23:15:11 +03:00
|
|
|
to_write = abd_alloc_for_io(asize, ismd);
|
2018-04-01 01:14:21 +03:00
|
|
|
abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize);
|
|
|
|
if (psize != asize)
|
|
|
|
abd_zero_off(to_write, psize, asize - psize);
|
2017-09-12 23:15:11 +03:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) &&
|
|
|
|
!HDR_ENCRYPTED(hdr)) {
|
|
|
|
ASSERT3U(size, ==, psize);
|
|
|
|
to_write = abd_alloc_for_io(asize, ismd);
|
|
|
|
abd_copy(to_write, hdr->b_l1hdr.b_pabd, size);
|
|
|
|
if (size != asize)
|
|
|
|
abd_zero_off(to_write, size, asize - size);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) {
|
2022-05-04 21:59:30 +03:00
|
|
|
/*
|
|
|
|
* In some cases, we can wind up with size > asize, so
|
|
|
|
* we need to opt for the larger allocation option here.
|
|
|
|
*
|
|
|
|
* (We also need abd_return_buf_copy in all cases because
|
|
|
|
* it's an ASSERT() to modify the buffer before returning it
|
|
|
|
* with arc_return_buf(), and all the compressors
|
|
|
|
* write things before deciding to fail compression in nearly
|
|
|
|
* every case.)
|
|
|
|
*/
|
2023-09-19 18:58:14 +03:00
|
|
|
uint64_t bufsize = MAX(size, asize);
|
|
|
|
cabd = abd_alloc_for_io(bufsize, ismd);
|
|
|
|
tmp = abd_borrow_buf(cabd, bufsize);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2023-02-28 01:41:02 +03:00
|
|
|
psize = zio_compress_data(compress, to_write, &tmp, size,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
hdr->b_complevel);
|
|
|
|
|
2022-05-04 21:59:30 +03:00
|
|
|
if (psize >= asize) {
|
|
|
|
psize = HDR_GET_PSIZE(hdr);
|
2023-09-19 18:58:14 +03:00
|
|
|
abd_return_buf_copy(cabd, tmp, bufsize);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
|
|
|
|
to_write = cabd;
|
2022-05-04 21:59:30 +03:00
|
|
|
abd_copy(to_write, hdr->b_l1hdr.b_pabd, psize);
|
|
|
|
if (psize != asize)
|
|
|
|
abd_zero_off(to_write, psize, asize - psize);
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
goto encrypt;
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT3U(psize, <=, HDR_GET_PSIZE(hdr));
|
|
|
|
if (psize < asize)
|
2023-09-19 18:58:14 +03:00
|
|
|
memset((char *)tmp + psize, 0, bufsize - psize);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
psize = HDR_GET_PSIZE(hdr);
|
2023-09-19 18:58:14 +03:00
|
|
|
abd_return_buf_copy(cabd, tmp, bufsize);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
to_write = cabd;
|
|
|
|
}
|
|
|
|
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
encrypt:
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (HDR_ENCRYPTED(hdr)) {
|
|
|
|
eabd = abd_alloc_for_io(asize, ismd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the dataset was disowned before the buffer
|
|
|
|
* made it to this point, the key to re-encrypt
|
|
|
|
* it won't be available. In this case we simply
|
|
|
|
* won't write the buffer to the L2ARC.
|
|
|
|
*/
|
|
|
|
ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj,
|
|
|
|
FTAG, &dck);
|
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
|
2019-10-24 20:17:33 +03:00
|
|
|
ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key,
|
2018-05-03 01:36:20 +03:00
|
|
|
hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt,
|
|
|
|
hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd,
|
|
|
|
&no_crypt);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ret != 0)
|
|
|
|
goto error;
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
if (no_crypt)
|
|
|
|
abd_copy(eabd, to_write, psize);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
if (psize != asize)
|
|
|
|
abd_zero_off(eabd, psize, asize - psize);
|
|
|
|
|
|
|
|
/* assert that the MAC we got here matches the one we saved */
|
2022-02-25 16:26:54 +03:00
|
|
|
ASSERT0(memcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN));
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
spa_keystore_dsl_key_rele(spa, dck, FTAG);
|
|
|
|
|
|
|
|
if (to_write == cabd)
|
|
|
|
abd_free(cabd);
|
|
|
|
|
|
|
|
to_write = eabd;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd);
|
|
|
|
*abd_out = to_write;
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
error:
|
|
|
|
if (dck != NULL)
|
|
|
|
spa_keystore_dsl_key_rele(spa, dck, FTAG);
|
|
|
|
if (cabd != NULL)
|
|
|
|
abd_free(cabd);
|
|
|
|
if (eabd != NULL)
|
|
|
|
abd_free(eabd);
|
|
|
|
|
|
|
|
*abd_out = NULL;
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
static void
|
|
|
|
l2arc_blk_fetch_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
|
|
|
|
cb = zio->io_private;
|
|
|
|
if (cb->l2rcb_abd != NULL)
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(cb->l2rcb_abd);
|
2020-04-10 20:33:35 +03:00
|
|
|
kmem_free(cb, sizeof (l2arc_read_callback_t));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Find and write ARC buffers to the L2ARC device.
|
|
|
|
*
|
2014-12-06 20:24:32 +03:00
|
|
|
* An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
|
2008-11-20 23:01:55 +03:00
|
|
|
* for reading until they have completed writing.
|
2013-08-02 00:02:10 +04:00
|
|
|
* The headroom_boost is an in-out parameter used to maintain headroom boost
|
|
|
|
* state between calls to this function.
|
|
|
|
*
|
|
|
|
* Returns the number of bytes actually written (which may be smaller than
|
2020-04-10 20:33:35 +03:00
|
|
|
* the delta by which the device hand has changed due to alignment and the
|
|
|
|
* writing of log blocks).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
static uint64_t
|
2016-06-02 07:04:53 +03:00
|
|
|
l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2020-04-10 20:33:35 +03:00
|
|
|
arc_buf_hdr_t *hdr, *hdr_prev, *head;
|
|
|
|
uint64_t write_asize, write_psize, write_lsize, headroom;
|
|
|
|
boolean_t full;
|
|
|
|
l2arc_write_callback_t *cb = NULL;
|
|
|
|
zio_t *pio, *wzio;
|
|
|
|
uint64_t guid = spa_load_guid(spa);
|
2021-01-28 20:20:03 +03:00
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(dev->l2ad_vdev, !=, NULL);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
pio = NULL;
|
2017-03-11 20:48:35 +03:00
|
|
|
write_lsize = write_asize = write_psize = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
full = B_FALSE;
|
2014-12-30 06:12:23 +03:00
|
|
|
head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Copy buffers for L2ARC writing.
|
|
|
|
*/
|
2021-01-24 02:49:32 +03:00
|
|
|
for (int pass = 0; pass < L2ARC_FEED_TYPES; pass++) {
|
2020-09-08 21:44:37 +03:00
|
|
|
/*
|
2021-01-24 02:49:32 +03:00
|
|
|
* If pass == 1 or 3, we cache MRU metadata and data
|
2020-09-08 21:44:37 +03:00
|
|
|
* respectively.
|
|
|
|
*/
|
|
|
|
if (l2arc_mfuonly) {
|
2021-01-24 02:49:32 +03:00
|
|
|
if (pass == 1 || pass == 3)
|
2020-09-08 21:44:37 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2021-01-24 02:49:32 +03:00
|
|
|
multilist_sublist_t *mls = l2arc_sublist_lock(pass);
|
2013-08-02 00:02:10 +04:00
|
|
|
uint64_t passed_sz = 0;
|
|
|
|
|
2016-11-01 02:04:01 +03:00
|
|
|
VERIFY3P(mls, !=, NULL);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* L2ARC fast warmup.
|
|
|
|
*
|
|
|
|
* Until the ARC is warm and starts to evict, read from the
|
|
|
|
* head of the ARC lists rather than the tail.
|
|
|
|
*/
|
|
|
|
if (arc_warm == B_FALSE)
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr = multilist_sublist_head(mls);
|
2008-12-03 23:09:06 +03:00
|
|
|
else
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr = multilist_sublist_tail(mls);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
headroom = target_sz * l2arc_headroom;
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zfs_compressed_arc_enabled)
|
2013-08-02 00:02:10 +04:00
|
|
|
headroom = (headroom * l2arc_headroom_boost) / 100;
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
for (; hdr; hdr = hdr_prev) {
|
2013-08-02 00:02:10 +04:00
|
|
|
kmutex_t *hash_lock;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
abd_t *to_write = NULL;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (arc_warm == B_FALSE)
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr_prev = multilist_sublist_next(mls, hdr);
|
2008-12-03 23:09:06 +03:00
|
|
|
else
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr_prev = multilist_sublist_prev(mls, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2013-08-02 00:02:10 +04:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Skip this buffer rather than waiting.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
passed_sz += HDR_GET_LSIZE(hdr);
|
2020-04-10 20:33:35 +03:00
|
|
|
if (l2arc_headroom != 0 && passed_sz > headroom) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Searched too far.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (!l2arc_write_eligible(guid, hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-03-11 20:48:35 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
|
|
|
|
ASSERT3U(arc_hdr_size(hdr), >, 0);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_pabd != NULL ||
|
|
|
|
HDR_HAS_RABD(hdr));
|
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
2017-03-11 20:48:35 +03:00
|
|
|
uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,
|
|
|
|
psize);
|
|
|
|
|
2023-06-10 03:05:47 +03:00
|
|
|
/*
|
|
|
|
* If the allocated size of this buffer plus the max
|
|
|
|
* size for the pending log block exceeds the evicted
|
|
|
|
* target size, terminate writing buffers for this run.
|
|
|
|
*/
|
|
|
|
if (write_asize + asize +
|
|
|
|
sizeof (l2arc_log_blk_phys_t) > target_sz) {
|
2008-11-20 23:01:55 +03:00
|
|
|
full = B_TRUE;
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* We rely on the L1 portion of the header below, so
|
|
|
|
* it's invalid for this header to have been evicted out
|
|
|
|
* of the ghost cache, prior to being written out. The
|
|
|
|
* ARC_FLAG_L2_WRITING bit ensures this won't happen.
|
|
|
|
*/
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2_WRITING);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this header has b_rabd, we can use this since it
|
|
|
|
* must always match the data exactly as it exists on
|
2020-08-11 23:16:57 +03:00
|
|
|
* disk. Otherwise, the L2ARC can normally use the
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* hdr's data, but if we're sharing data between the
|
|
|
|
* hdr and one of its bufs, L2ARC needs its own copy of
|
|
|
|
* the data so that the ZIO below can't race with the
|
|
|
|
* buf consumer. To ensure that this copy will be
|
|
|
|
* available for the lifetime of the ZIO and be cleaned
|
|
|
|
* up afterwards, we add it to the l2arc_free_on_write
|
|
|
|
* queue. If we need to apply any transforms to the
|
|
|
|
* data (compression, encryption) we will also need the
|
|
|
|
* extra buffer.
|
|
|
|
*/
|
|
|
|
if (HDR_HAS_RABD(hdr) && psize == asize) {
|
|
|
|
to_write = hdr->b_crypt_hdr.b_rabd;
|
|
|
|
} else if ((HDR_COMPRESSION_ENABLED(hdr) ||
|
|
|
|
HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) &&
|
|
|
|
!HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) &&
|
|
|
|
psize == asize) {
|
|
|
|
to_write = hdr->b_l1hdr.b_pabd;
|
|
|
|
} else {
|
|
|
|
int ret;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
ret = l2arc_apply_transforms(spa, hdr, asize,
|
|
|
|
&to_write);
|
|
|
|
if (ret != 0) {
|
|
|
|
arc_hdr_clear_flags(hdr,
|
|
|
|
ARC_FLAG_L2_WRITING);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
l2arc_free_abd_on_write(to_write, asize, type);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (pio == NULL) {
|
|
|
|
/*
|
|
|
|
* Insert a dummy header on the buflist so
|
|
|
|
* l2arc_write_done() can find where the
|
|
|
|
* write buffers begin without searching.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_insert_head(&dev->l2ad_buflist, head);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-29 20:02:03 +03:00
|
|
|
cb = kmem_alloc(
|
|
|
|
sizeof (l2arc_write_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2wcb_dev = dev;
|
|
|
|
cb->l2wcb_head = head;
|
2020-05-08 02:34:03 +03:00
|
|
|
/*
|
|
|
|
* Create a list to save allocated abd buffers
|
|
|
|
* for l2arc_log_blk_commit().
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
list_create(&cb->l2wcb_abd_list,
|
|
|
|
sizeof (l2arc_lb_abd_buf_t),
|
|
|
|
offsetof(l2arc_lb_abd_buf_t, node));
|
2008-11-20 23:01:55 +03:00
|
|
|
pio = zio_root(spa, l2arc_write_done, cb,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l2hdr.b_dev = dev;
|
|
|
|
hdr->b_l2hdr.b_hits = 0;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
hdr->b_l2hdr.b_arcs_state =
|
|
|
|
hdr->b_l1hdr.b_state->arcs_state;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_insert_head(&dev->l2ad_buflist, hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-10-01 20:42:05 +03:00
|
|
|
(void) zfs_refcount_add_many(&dev->l2ad_alloc,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
wzio = zio_write_phys(pio, dev->l2ad_vdev,
|
2017-06-27 03:32:43 +03:00
|
|
|
hdr->b_l2hdr.b_daddr, asize, to_write,
|
2016-06-02 07:04:53 +03:00
|
|
|
ZIO_CHECKSUM_OFF, NULL, hdr,
|
|
|
|
ZIO_PRIORITY_ASYNC_WRITE,
|
2008-11-20 23:01:55 +03:00
|
|
|
ZIO_FLAG_CANFAIL, B_FALSE);
|
|
|
|
|
2017-03-11 20:48:35 +03:00
|
|
|
write_lsize += HDR_GET_LSIZE(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
|
|
|
|
zio_t *, wzio);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2017-03-11 20:48:35 +03:00
|
|
|
write_psize += psize;
|
|
|
|
write_asize += asize;
|
2016-06-02 07:04:53 +03:00
|
|
|
dev->l2ad_hand += asize;
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
l2arc_hdr_arcstats_increment(hdr);
|
2019-01-31 20:16:39 +03:00
|
|
|
vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Append buf info to current log and commit if full.
|
|
|
|
* arcstat_l2_{size,asize} kstats are updated
|
|
|
|
* internally.
|
|
|
|
*/
|
2023-06-06 22:32:37 +03:00
|
|
|
if (l2arc_log_blk_insert(dev, hdr)) {
|
|
|
|
/*
|
2023-06-10 03:05:47 +03:00
|
|
|
* l2ad_hand will be adjusted in
|
2023-06-06 22:32:37 +03:00
|
|
|
* l2arc_log_blk_commit().
|
|
|
|
*/
|
|
|
|
write_asize +=
|
|
|
|
l2arc_log_blk_commit(dev, pio, cb);
|
|
|
|
}
|
2020-04-10 20:33:35 +03:00
|
|
|
|
2020-02-29 01:49:44 +03:00
|
|
|
zio_nowait(wzio);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
multilist_sublist_unlock(mls);
|
|
|
|
|
|
|
|
if (full == B_TRUE)
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* No buffers selected for writing? */
|
|
|
|
if (pio == NULL) {
|
2017-03-11 20:48:35 +03:00
|
|
|
ASSERT0(write_lsize);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_HAS_L1HDR(head));
|
|
|
|
kmem_cache_free(hdr_l2only_cache, head);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Although we did not write any buffers l2ad_evict may
|
|
|
|
* have advanced.
|
|
|
|
*/
|
2021-01-28 20:20:03 +03:00
|
|
|
if (dev->l2ad_evict != l2dhdr->dh_evict)
|
|
|
|
l2arc_dev_hdr_update(dev);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
if (!dev->l2ad_first)
|
|
|
|
ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
ASSERT3U(write_asize, <=, target_sz);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_sent);
|
2017-03-11 20:48:35 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) zio_wait(pio);
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_FALSE;
|
|
|
|
|
2020-07-11 00:10:03 +03:00
|
|
|
/*
|
|
|
|
* Update the device header after the zio completes as
|
|
|
|
* l2arc_write_done() may have updated the memory holding the log block
|
|
|
|
* pointers in the device header.
|
|
|
|
*/
|
|
|
|
l2arc_dev_hdr_update(dev);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
return (write_asize);
|
|
|
|
}
|
|
|
|
|
2020-08-26 00:33:36 +03:00
|
|
|
static boolean_t
|
|
|
|
l2arc_hdr_limit_reached(void)
|
|
|
|
{
|
2021-06-17 03:19:34 +03:00
|
|
|
int64_t s = aggsum_upper_bound(&arc_sums.arcstat_l2_hdr_size);
|
2020-08-26 00:33:36 +03:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
return (arc_reclaim_needed() ||
|
2020-08-26 00:33:36 +03:00
|
|
|
(s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This thread feeds the L2ARC at regular intervals. This is the beating
|
|
|
|
* heart of the L2ARC.
|
|
|
|
*/
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547
Closes #5503
Closes #5523
Closes #6377
Closes #6495
2017-08-11 18:51:44 +03:00
|
|
|
l2arc_feed_thread(void *unused)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) unused;
|
2008-11-20 23:01:55 +03:00
|
|
|
callb_cpr_t cpr;
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
spa_t *spa;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t size, wrote;
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t begin, next = ddi_get_lbolt();
|
2015-03-31 06:43:29 +03:00
|
|
|
fstrans_cookie_t cookie;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
|
2015-03-31 06:43:29 +03:00
|
|
|
cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
while (l2arc_thread_exit == 0) {
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2020-09-04 06:04:09 +03:00
|
|
|
(void) cv_timedwait_idle(&l2arc_feed_thr_cv,
|
2010-12-10 23:00:00 +03:00
|
|
|
&l2arc_feed_thr_lock, next);
|
2008-11-20 23:01:55 +03:00
|
|
|
CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
next = ddi_get_lbolt() + hz;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Quick check for L2ARC devices.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
if (l2arc_ndev == 0) {
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
begin = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* This selects the next l2arc device to write to, and in
|
|
|
|
* doing so the next spa to feed from: dev->l2ad_spa. This
|
|
|
|
* will return NULL if there are now no l2arc devices or if
|
|
|
|
* they are all faulted.
|
|
|
|
*
|
|
|
|
* If a device is returned, its spa's config lock is also
|
|
|
|
* held to prevent device removal. l2arc_dev_get_next()
|
|
|
|
* will grab and release l2arc_dev_mtx.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((dev = l2arc_dev_get_next()) == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
spa = dev->l2ad_spa;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(spa, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If the pool is read-only then force the feed thread to
|
|
|
|
* sleep a little longer.
|
|
|
|
*/
|
|
|
|
if (!spa_writeable(spa)) {
|
|
|
|
next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Avoid contributing to memory pressure.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2020-08-26 00:33:36 +03:00
|
|
|
if (l2arc_hdr_limit_reached()) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_feeds);
|
|
|
|
|
2020-03-31 20:46:48 +03:00
|
|
|
size = l2arc_write_size(dev);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Evict L2ARC buffers that will be overwritten.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_evict(dev, size, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Write ARC buffers.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
wrote = l2arc_write_buffers(spa, dev, size);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate interval between writes.
|
|
|
|
*/
|
|
|
|
next = l2arc_write_interval(begin, size, wrote);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-03-31 06:43:29 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
cv_broadcast(&l2arc_feed_thr_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
boolean_t
|
|
|
|
l2arc_vdev_present(vdev_t *vd)
|
|
|
|
{
|
2020-04-10 20:33:35 +03:00
|
|
|
return (l2arc_vdev_get(vd) != NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns the l2arc_dev_t associated with a particular vdev_t or NULL if
|
|
|
|
* the vdev_t isn't an L2ARC device.
|
|
|
|
*/
|
2020-06-09 20:15:08 +03:00
|
|
|
l2arc_dev_t *
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_vdev_get(vdev_t *vd)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
for (dev = list_head(l2arc_dev_list); dev != NULL;
|
|
|
|
dev = list_next(l2arc_dev_list, dev)) {
|
|
|
|
if (dev->l2ad_vdev == vd)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
return (dev);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2021-07-26 22:30:24 +03:00
|
|
|
static void
|
|
|
|
l2arc_rebuild_dev(l2arc_dev_t *dev, boolean_t reopen)
|
|
|
|
{
|
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
|
|
|
uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;
|
|
|
|
spa_t *spa = dev->l2ad_spa;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The L2ARC has to hold at least the payload of one log block for
|
|
|
|
* them to be restored (persistent L2ARC). The payload of a log block
|
|
|
|
* depends on the amount of its log entries. We always write log blocks
|
|
|
|
* with 1022 entries. How many of them are committed or restored depends
|
|
|
|
* on the size of the L2ARC device. Thus the maximum payload of
|
|
|
|
* one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device
|
|
|
|
* is less than that, we reduce the amount of committed and restored
|
|
|
|
* log entries per block so as to enable persistence.
|
|
|
|
*/
|
|
|
|
if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) {
|
|
|
|
dev->l2ad_log_entries = 0;
|
|
|
|
} else {
|
|
|
|
dev->l2ad_log_entries = MIN((dev->l2ad_end -
|
|
|
|
dev->l2ad_start) >> SPA_MAXBLOCKSHIFT,
|
|
|
|
L2ARC_LOG_BLK_MAX_ENTRIES);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the device header, if an error is returned do not rebuild L2ARC.
|
|
|
|
*/
|
|
|
|
if (l2arc_dev_hdr_read(dev) == 0 && dev->l2ad_log_entries > 0) {
|
|
|
|
/*
|
|
|
|
* If we are onlining a cache device (vdev_reopen) that was
|
|
|
|
* still present (l2arc_vdev_present()) and rebuild is enabled,
|
|
|
|
* we should evict all ARC buffers and pointers to log blocks
|
|
|
|
* and reclaim their space before restoring its contents to
|
|
|
|
* L2ARC.
|
|
|
|
*/
|
|
|
|
if (reopen) {
|
|
|
|
if (!l2arc_rebuild_enabled) {
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
l2arc_evict(dev, 0, B_TRUE);
|
|
|
|
/* start a new log block */
|
|
|
|
dev->l2ad_log_ent_idx = 0;
|
|
|
|
dev->l2ad_log_blk_payload_asize = 0;
|
|
|
|
dev->l2ad_log_blk_payload_start = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Just mark the device as pending for a rebuild. We won't
|
|
|
|
* be starting a rebuild in line here as it would block pool
|
|
|
|
* import. Instead spa_load_impl will hand that off to an
|
|
|
|
* async task which will call l2arc_spa_rebuild_start.
|
|
|
|
*/
|
|
|
|
dev->l2ad_rebuild = B_TRUE;
|
|
|
|
} else if (spa_writeable(spa)) {
|
|
|
|
/*
|
|
|
|
* In this case TRIM the whole device if l2arc_trim_ahead > 0,
|
|
|
|
* otherwise create a new header. We zero out the memory holding
|
|
|
|
* the header to reset dh_start_lbps. If we TRIM the whole
|
|
|
|
* device the new header will be written by
|
|
|
|
* vdev_trim_l2arc_thread() at the end of the TRIM to update the
|
|
|
|
* trim_state in the header too. When reading the header, if
|
|
|
|
* trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 0
|
|
|
|
* we opt to TRIM the whole device again.
|
|
|
|
*/
|
|
|
|
if (l2arc_trim_ahead > 0) {
|
|
|
|
dev->l2ad_trim_all = B_TRUE;
|
|
|
|
} else {
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(l2dhdr, 0, l2dhdr_asize);
|
2021-07-26 22:30:24 +03:00
|
|
|
l2arc_dev_hdr_update(dev);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Add a vdev for use by the L2ARC. By this point the spa has already
|
|
|
|
* validated the vdev and opened it.
|
|
|
|
*/
|
|
|
|
void
|
2009-07-03 02:44:48 +04:00
|
|
|
l2arc_add_vdev(spa_t *spa, vdev_t *vd)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_dev_t *adddev;
|
|
|
|
uint64_t l2dhdr_asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(!l2arc_vdev_present(vd));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create a new l2arc device entry.
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
adddev->l2ad_spa = spa;
|
|
|
|
adddev->l2ad_vdev = vd;
|
2020-04-10 20:33:35 +03:00
|
|
|
/* leave extra size for an l2arc device header */
|
|
|
|
l2dhdr_asize = adddev->l2ad_dev_hdr_asize =
|
|
|
|
MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift);
|
|
|
|
adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize;
|
2009-07-03 02:44:48 +04:00
|
|
|
adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
|
2020-04-10 20:33:35 +03:00
|
|
|
ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);
|
2008-11-20 23:01:55 +03:00
|
|
|
adddev->l2ad_hand = adddev->l2ad_start;
|
2020-04-10 20:33:35 +03:00
|
|
|
adddev->l2ad_evict = adddev->l2ad_start;
|
2008-11-20 23:01:55 +03:00
|
|
|
adddev->l2ad_first = B_TRUE;
|
2009-02-18 23:51:31 +03:00
|
|
|
adddev->l2ad_writing = B_FALSE;
|
2020-06-09 20:15:08 +03:00
|
|
|
adddev->l2ad_trim_all = B_FALSE;
|
2010-08-26 21:26:44 +04:00
|
|
|
list_link_init(&adddev->l2ad_node);
|
2020-04-10 20:33:35 +03:00
|
|
|
adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This is a list of all ARC buffers that are still valid on the
|
|
|
|
* device.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* This is a list of pointers to log blocks that are still present
|
|
|
|
* on the device.
|
|
|
|
*/
|
|
|
|
list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t),
|
|
|
|
offsetof(l2arc_lb_ptr_buf_t, node));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_create(&adddev->l2ad_alloc);
|
2020-05-08 02:34:03 +03:00
|
|
|
zfs_refcount_create(&adddev->l2ad_lb_asize);
|
|
|
|
zfs_refcount_create(&adddev->l2ad_lb_count);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-07-26 22:30:24 +03:00
|
|
|
/*
|
|
|
|
* Decide if dev is eligible for L2ARC rebuild or whole device
|
|
|
|
* trimming. This has to happen before the device is added in the
|
|
|
|
* cache device list and l2arc_dev_mtx is released. Otherwise
|
|
|
|
* l2arc_feed_thread() might already start writing on the
|
|
|
|
* device.
|
|
|
|
*/
|
|
|
|
l2arc_rebuild_dev(adddev, B_FALSE);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Add device to global list
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
list_insert_head(l2arc_dev_list, adddev);
|
|
|
|
atomic_inc_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2020-04-10 20:33:35 +03:00
|
|
|
}
|
|
|
|
|
2021-07-26 22:30:24 +03:00
|
|
|
/*
|
|
|
|
* Decide if a vdev is eligible for L2ARC rebuild, called from vdev_reopen()
|
|
|
|
* in case of onlining a cache device.
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
void
|
|
|
|
l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev = NULL;
|
|
|
|
|
|
|
|
dev = l2arc_vdev_get(vd);
|
|
|
|
ASSERT3P(dev, !=, NULL);
|
|
|
|
|
|
|
|
/*
|
2021-07-26 22:30:24 +03:00
|
|
|
* In contrast to l2arc_add_vdev() we do not have to worry about
|
|
|
|
* l2arc_feed_thread() invalidating previous content when onlining a
|
|
|
|
* cache device. The device parameters (l2ad*) are not cleared when
|
|
|
|
* offlining the device and writing new buffers will not invalidate
|
|
|
|
* all previous content. In worst case only buffers that have not had
|
|
|
|
* their log block written to the device will be lost.
|
|
|
|
* When onlining the cache device (ie offline->online without exporting
|
|
|
|
* the pool in between) this happens:
|
|
|
|
* vdev_reopen() -> vdev_open() -> l2arc_rebuild_vdev()
|
|
|
|
* | |
|
|
|
|
* vdev_is_dead() = B_FALSE l2ad_rebuild = B_TRUE
|
|
|
|
* During the time where vdev_is_dead = B_FALSE and until l2ad_rebuild
|
|
|
|
* is set to B_TRUE we might write additional buffers to the device.
|
|
|
|
*/
|
|
|
|
l2arc_rebuild_dev(dev, reopen);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a vdev from the L2ARC.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
l2arc_remove_vdev(vdev_t *vd)
|
|
|
|
{
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_dev_t *remdev = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the device by vdev
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
remdev = l2arc_vdev_get(vd);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(remdev, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Cancel any ongoing or scheduled rebuild.
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_rebuild_thr_lock);
|
|
|
|
if (remdev->l2ad_rebuild_began == B_TRUE) {
|
|
|
|
remdev->l2ad_rebuild_cancel = B_TRUE;
|
|
|
|
while (remdev->l2ad_rebuild == B_TRUE)
|
|
|
|
cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_rebuild_thr_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Remove device from global list
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
list_remove(l2arc_dev_list, remdev);
|
|
|
|
l2arc_dev_last = NULL; /* may have been invalidated */
|
2008-12-03 23:09:06 +03:00
|
|
|
atomic_dec_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear all buflists and ARC references. L2ARC device flush.
|
|
|
|
*/
|
|
|
|
l2arc_evict(remdev, 0, B_TRUE);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_destroy(&remdev->l2ad_buflist);
|
2020-04-10 20:33:35 +03:00
|
|
|
ASSERT(list_is_empty(&remdev->l2ad_lbptr_list));
|
|
|
|
list_destroy(&remdev->l2ad_lbptr_list);
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_destroy(&remdev->l2ad_mtx);
|
2018-10-01 20:42:05 +03:00
|
|
|
zfs_refcount_destroy(&remdev->l2ad_alloc);
|
2020-05-08 02:34:03 +03:00
|
|
|
zfs_refcount_destroy(&remdev->l2ad_lb_asize);
|
|
|
|
zfs_refcount_destroy(&remdev->l2ad_lb_count);
|
2020-04-10 20:33:35 +03:00
|
|
|
kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);
|
|
|
|
vmem_free(remdev, sizeof (l2arc_dev_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
l2arc_ndev = 0;
|
|
|
|
|
|
|
|
mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
|
|
|
l2arc_dev_list = &L2ARC_dev_list;
|
|
|
|
l2arc_free_on_write = &L2ARC_free_on_write;
|
|
|
|
list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
|
|
|
|
offsetof(l2arc_dev_t, l2ad_node));
|
|
|
|
list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
|
|
|
|
offsetof(l2arc_data_free_t, l2df_list_node));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_fini(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
mutex_destroy(&l2arc_feed_thr_lock);
|
|
|
|
cv_destroy(&l2arc_feed_thr_cv);
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_destroy(&l2arc_rebuild_thr_lock);
|
|
|
|
cv_destroy(&l2arc_rebuild_thr_cv);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_destroy(&l2arc_dev_mtx);
|
|
|
|
mutex_destroy(&l2arc_free_on_write_mtx);
|
|
|
|
|
|
|
|
list_destroy(l2arc_dev_list);
|
|
|
|
list_destroy(l2arc_free_on_write);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_start(void)
|
|
|
|
{
|
2019-11-21 20:32:57 +03:00
|
|
|
if (!(spa_mode_global & SPA_MODE_WRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
|
2015-07-24 20:08:31 +03:00
|
|
|
TS_RUN, defclsyspri);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_stop(void)
|
|
|
|
{
|
2019-11-21 20:32:57 +03:00
|
|
|
if (!(spa_mode_global & SPA_MODE_WRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
|
|
|
|
l2arc_thread_exit = 1;
|
|
|
|
while (l2arc_thread_exit != 0)
|
|
|
|
cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
|
|
|
|
mutex_exit(&l2arc_feed_thr_lock);
|
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Punches out rebuild threads for the L2ARC devices in a spa. This should
|
|
|
|
* be called after pool import from the spa async thread, since starting
|
|
|
|
* these threads directly from spa_import() will make them part of the
|
|
|
|
* "zpool import" context and delay process exit (and thus pool import).
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
l2arc_spa_rebuild_start(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Locate the spa's l2arc devices and kick off rebuild threads.
|
|
|
|
*/
|
|
|
|
for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {
|
|
|
|
l2arc_dev_t *dev =
|
|
|
|
l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);
|
|
|
|
if (dev == NULL) {
|
|
|
|
/* Don't attempt a rebuild if the vdev is UNAVAIL */
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
mutex_enter(&l2arc_rebuild_thr_lock);
|
|
|
|
if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {
|
|
|
|
dev->l2ad_rebuild_began = B_TRUE;
|
2020-08-17 21:02:32 +03:00
|
|
|
(void) thread_create(NULL, 0, l2arc_dev_rebuild_thread,
|
2020-04-10 20:33:35 +03:00
|
|
|
dev, 0, &p0, TS_RUN, minclsyspri);
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_rebuild_thr_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Main entry point for L2ARC rebuilding.
|
|
|
|
*/
|
2022-03-23 18:51:00 +03:00
|
|
|
static __attribute__((noreturn)) void
|
2020-08-17 21:02:32 +03:00
|
|
|
l2arc_dev_rebuild_thread(void *arg)
|
2020-04-10 20:33:35 +03:00
|
|
|
{
|
2020-08-17 21:02:32 +03:00
|
|
|
l2arc_dev_t *dev = arg;
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
VERIFY(!dev->l2ad_rebuild_cancel);
|
|
|
|
VERIFY(dev->l2ad_rebuild);
|
|
|
|
(void) l2arc_rebuild(dev);
|
|
|
|
mutex_enter(&l2arc_rebuild_thr_lock);
|
|
|
|
dev->l2ad_rebuild_began = B_FALSE;
|
|
|
|
dev->l2ad_rebuild = B_FALSE;
|
|
|
|
mutex_exit(&l2arc_rebuild_thr_lock);
|
|
|
|
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function implements the actual L2ARC metadata rebuild. It:
|
|
|
|
* starts reading the log block chain and restores each block's contents
|
|
|
|
* to memory (reconstructing arc_buf_hdr_t's).
|
|
|
|
*
|
|
|
|
* Operation stops under any of the following conditions:
|
|
|
|
*
|
|
|
|
* 1) We reach the end of the log block chain.
|
|
|
|
* 2) We encounter *any* error condition (cksum errors, io errors)
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
l2arc_rebuild(l2arc_dev_t *dev)
|
|
|
|
{
|
|
|
|
vdev_t *vd = dev->l2ad_vdev;
|
|
|
|
spa_t *spa = vd->vdev_spa;
|
2020-05-08 02:34:03 +03:00
|
|
|
int err = 0;
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
|
|
|
l2arc_log_blk_phys_t *this_lb, *next_lb;
|
|
|
|
zio_t *this_io = NULL, *next_io = NULL;
|
|
|
|
l2arc_log_blkptr_t lbps[2];
|
|
|
|
l2arc_lb_ptr_buf_t *lb_ptr_buf;
|
|
|
|
boolean_t lock_held;
|
|
|
|
|
|
|
|
this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP);
|
|
|
|
next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We prevent device removal while issuing reads to the device,
|
|
|
|
* then during the rebuilding phases we drop this lock again so
|
|
|
|
* that a spa_unload or device remove can be initiated - this is
|
|
|
|
* safe, because the spa will signal us to stop before removing
|
|
|
|
* our device and wait for us to stop.
|
|
|
|
*/
|
|
|
|
spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);
|
|
|
|
lock_held = B_TRUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Retrieve the persistent L2ARC device state.
|
2020-05-08 02:34:03 +03:00
|
|
|
* L2BLK_GET_PSIZE returns aligned size for log blocks.
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
|
|
|
dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start);
|
|
|
|
dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr +
|
|
|
|
L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop),
|
|
|
|
dev->l2ad_start);
|
|
|
|
dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST);
|
|
|
|
|
2020-06-09 20:15:08 +03:00
|
|
|
vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time;
|
|
|
|
vd->vdev_trim_state = l2dhdr->dh_trim_state;
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* In case the zfs module parameter l2arc_rebuild_enabled is false
|
|
|
|
* we do not start the rebuild process.
|
|
|
|
*/
|
|
|
|
if (!l2arc_rebuild_enabled)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Prepare the rebuild process */
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(lbps, l2dhdr->dh_start_lbps, sizeof (lbps));
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/* Start the rebuild process */
|
|
|
|
for (;;) {
|
|
|
|
if (!l2arc_log_blkptr_valid(dev, &lbps[0]))
|
|
|
|
break;
|
|
|
|
|
|
|
|
if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1],
|
|
|
|
this_lb, next_lb, this_io, &next_io)) != 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Our memory pressure valve. If the system is running low
|
|
|
|
* on memory, rather than swamping memory with new ARC buf
|
|
|
|
* hdrs, we opt not to rebuild the L2ARC. At this point,
|
|
|
|
* however, we have already set up our L2ARC dev to chain in
|
|
|
|
* new metadata log blocks, so the user may choose to offline/
|
|
|
|
* online the L2ARC dev at a later time (or re-import the pool)
|
|
|
|
* to reconstruct it (when there's less memory pressure).
|
|
|
|
*/
|
2020-08-26 00:33:36 +03:00
|
|
|
if (l2arc_hdr_limit_reached()) {
|
2020-04-10 20:33:35 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);
|
|
|
|
cmn_err(CE_NOTE, "System running low on memory, "
|
|
|
|
"aborting L2ARC rebuild.");
|
|
|
|
err = SET_ERROR(ENOMEM);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
|
|
|
lock_held = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we know that the next_lb checks out alright, we
|
|
|
|
* can start reconstruction from this log block.
|
2020-05-08 02:34:03 +03:00
|
|
|
* L2BLK_GET_PSIZE returns aligned size for log blocks.
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
2020-05-08 02:34:03 +03:00
|
|
|
uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop);
|
2020-10-06 01:29:05 +03:00
|
|
|
l2arc_log_blk_restore(dev, this_lb, asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* log block restored, include its pointer in the list of
|
|
|
|
* pointers to log blocks present in the L2ARC device.
|
|
|
|
*/
|
|
|
|
lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);
|
|
|
|
lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t),
|
|
|
|
KM_SLEEP);
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(lb_ptr_buf->lb_ptr, &lbps[0],
|
2020-04-10 20:33:35 +03:00
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_log_blk_count);
|
|
|
|
zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);
|
|
|
|
zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2020-05-08 02:34:03 +03:00
|
|
|
vdev_space_update(vd, asize, 0, 0);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Protection against loops of log blocks:
|
|
|
|
*
|
|
|
|
* l2ad_hand l2ad_evict
|
|
|
|
* V V
|
|
|
|
* l2ad_start |=======================================| l2ad_end
|
|
|
|
* -----|||----|||---|||----|||
|
|
|
|
* (3) (2) (1) (0)
|
|
|
|
* ---|||---|||----|||---|||
|
|
|
|
* (7) (6) (5) (4)
|
|
|
|
*
|
|
|
|
* In this situation the pointer of log block (4) passes
|
|
|
|
* l2arc_log_blkptr_valid() but the log block should not be
|
|
|
|
* restored as it is overwritten by the payload of log block
|
|
|
|
* (0). Only log blocks (0)-(3) should be restored. We check
|
2020-05-08 02:34:03 +03:00
|
|
|
* whether l2ad_evict lies in between the payload starting
|
|
|
|
* offset of the next log block (lbps[1].lbp_payload_start)
|
|
|
|
* and the payload starting offset of the present log block
|
|
|
|
* (lbps[0].lbp_payload_start). If true and this isn't the
|
|
|
|
* first pass, we are looping from the beginning and we should
|
|
|
|
* stop.
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
2020-05-08 02:34:03 +03:00
|
|
|
if (l2arc_range_check_overlap(lbps[1].lbp_payload_start,
|
|
|
|
lbps[0].lbp_payload_start, dev->l2ad_evict) &&
|
|
|
|
!dev->l2ad_first)
|
2020-04-10 20:33:35 +03:00
|
|
|
goto out;
|
|
|
|
|
Cleanup: Use OpenSolaris functions to call scheduler
In our codebase, `cond_resched() and `schedule()` are Linux kernel
functions that have replaced the OpenSolaris `kpreempt()` functions in
the codebase to such an extent that `kpreempt()` in zfs_context.h was
broken. Nobody noticed because we did not actually use it. The header
had defined `kpreempt()` as `yield()`, which works on OpenSolaris and
Illumos where `sched_yield()` is a wrapper for `yield()`, but that does
not work on any other platform.
The FreeBSD platform specific code implemented shims for these, but the
shim for `schedule()` forced us to wait, which is different than merely
rescheduling to another thread as the original Linux code does, while
the shim for `cond_resched()` had the same definition as its kernel
kpreempt() shim.
After studying this, I have concluded that we should reintroduce the
kpreempt() function in platform independent code with the following
definitions:
- In the Linux kernel:
kpreempt(unused) -> cond_resched()
- In the FreeBSD kernel:
kpreempt(unused) -> kern_yield(PRI_USER)
- In userspace:
kpreempt(unused) -> sched_yield()
In userspace, nothing changes from this cleanup. In the kernels, the
function `fm_fini()` will now call `kern_yield(PRI_USER)` on FreeBSD and
`cond_resched()` on Linux. This is instead of `pause("schedule", 1)` on
FreeBSD and `schedule()` on Linux. This makes our behavior consistent
across platforms.
Note that Linux's SPL continues to use `cond_resched()` and
`schedule()`. However, those functions have been removed from both the
FreeBSD code and userspace code.
This should have the benefit of making it slightly easier to port the
code to new platforms by making how things should be mapped less
confusing.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13845
2022-09-12 19:55:37 +03:00
|
|
|
kpreempt(KPREEMPT_SYNC);
|
2020-04-10 20:33:35 +03:00
|
|
|
for (;;) {
|
|
|
|
mutex_enter(&l2arc_rebuild_thr_lock);
|
|
|
|
if (dev->l2ad_rebuild_cancel) {
|
|
|
|
dev->l2ad_rebuild = B_FALSE;
|
|
|
|
cv_signal(&l2arc_rebuild_thr_cv);
|
|
|
|
mutex_exit(&l2arc_rebuild_thr_lock);
|
|
|
|
err = SET_ERROR(ECANCELED);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_rebuild_thr_lock);
|
|
|
|
if (spa_config_tryenter(spa, SCL_L2ARC, vd,
|
|
|
|
RW_READER)) {
|
|
|
|
lock_held = B_TRUE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* L2ARC config lock held by somebody in writer,
|
|
|
|
* possibly due to them trying to remove us. They'll
|
|
|
|
* likely to want us to shut down, so after a little
|
|
|
|
* delay, we check l2ad_rebuild_cancel and retry
|
|
|
|
* the lock again.
|
|
|
|
*/
|
|
|
|
delay(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Continue with the next log block.
|
|
|
|
*/
|
|
|
|
lbps[0] = lbps[1];
|
|
|
|
lbps[1] = this_lb->lb_prev_lbp;
|
|
|
|
PTR_SWAP(this_lb, next_lb);
|
|
|
|
this_io = next_io;
|
|
|
|
next_io = NULL;
|
2020-10-06 01:29:05 +03:00
|
|
|
}
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
if (this_io != NULL)
|
|
|
|
l2arc_log_blk_fetch_abort(this_io);
|
|
|
|
out:
|
|
|
|
if (next_io != NULL)
|
|
|
|
l2arc_log_blk_fetch_abort(next_io);
|
|
|
|
vmem_free(this_lb, sizeof (*this_lb));
|
|
|
|
vmem_free(next_lb, sizeof (*next_lb));
|
|
|
|
|
|
|
|
if (!l2arc_rebuild_enabled) {
|
2020-05-08 02:34:03 +03:00
|
|
|
spa_history_log_internal(spa, "L2ARC rebuild", NULL,
|
|
|
|
"disabled");
|
|
|
|
} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) {
|
2020-04-10 20:33:35 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_success);
|
2020-05-08 02:34:03 +03:00
|
|
|
spa_history_log_internal(spa, "L2ARC rebuild", NULL,
|
|
|
|
"successful, restored %llu blocks",
|
|
|
|
(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
|
|
|
|
} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) {
|
|
|
|
/*
|
|
|
|
* No error but also nothing restored, meaning the lbps array
|
|
|
|
* in the device header points to invalid/non-present log
|
|
|
|
* blocks. Reset the header.
|
|
|
|
*/
|
|
|
|
spa_history_log_internal(spa, "L2ARC rebuild", NULL,
|
|
|
|
"no valid log blocks");
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(l2dhdr, 0, dev->l2ad_dev_hdr_asize);
|
2020-05-08 02:34:03 +03:00
|
|
|
l2arc_dev_hdr_update(dev);
|
2020-08-01 21:17:18 +03:00
|
|
|
} else if (err == ECANCELED) {
|
|
|
|
/*
|
|
|
|
* In case the rebuild was canceled do not log to spa history
|
|
|
|
* log as the pool may be in the process of being removed.
|
|
|
|
*/
|
|
|
|
zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks",
|
2021-06-23 07:53:45 +03:00
|
|
|
(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
|
2020-04-10 20:33:35 +03:00
|
|
|
} else if (err != 0) {
|
2020-05-08 02:34:03 +03:00
|
|
|
spa_history_log_internal(spa, "L2ARC rebuild", NULL,
|
|
|
|
"aborted, restored %llu blocks",
|
|
|
|
(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));
|
2020-04-10 20:33:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (lock_held)
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
|
|
|
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Attempts to read the device header on the provided L2ARC device and writes
|
|
|
|
* it to `hdr'. On success, this function returns 0, otherwise the appropriate
|
|
|
|
* error code is returned.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
l2arc_dev_hdr_read(l2arc_dev_t *dev)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
uint64_t guid;
|
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
|
|
|
const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;
|
|
|
|
abd_t *abd;
|
|
|
|
|
|
|
|
guid = spa_guid(dev->l2ad_vdev->vdev_spa);
|
|
|
|
|
|
|
|
abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);
|
|
|
|
|
|
|
|
err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,
|
|
|
|
VDEV_LABEL_START_SIZE, l2dhdr_asize, abd,
|
2020-10-06 01:29:05 +03:00
|
|
|
ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_SYNC_READ,
|
2023-06-09 22:40:55 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |
|
2020-04-10 20:33:35 +03:00
|
|
|
ZIO_FLAG_SPECULATIVE, B_FALSE));
|
|
|
|
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(abd);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
if (err != 0) {
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors);
|
|
|
|
zfs_dbgmsg("L2ARC IO error (%d) while reading device header, "
|
2021-06-23 07:53:45 +03:00
|
|
|
"vdev guid: %llu", err,
|
|
|
|
(u_longlong_t)dev->l2ad_vdev->vdev_guid);
|
2020-04-10 20:33:35 +03:00
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC))
|
|
|
|
byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr));
|
|
|
|
|
|
|
|
if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC ||
|
|
|
|
l2dhdr->dh_spa_guid != guid ||
|
|
|
|
l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid ||
|
|
|
|
l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION ||
|
2020-05-08 02:34:03 +03:00
|
|
|
l2dhdr->dh_log_entries != dev->l2ad_log_entries ||
|
2020-04-10 20:33:35 +03:00
|
|
|
l2dhdr->dh_end != dev->l2ad_end ||
|
|
|
|
!l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end,
|
2020-06-09 20:15:08 +03:00
|
|
|
l2dhdr->dh_evict) ||
|
|
|
|
(l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE &&
|
|
|
|
l2arc_trim_ahead > 0)) {
|
2020-04-10 20:33:35 +03:00
|
|
|
/*
|
|
|
|
* Attempt to rebuild a device containing no actual dev hdr
|
|
|
|
* or containing a header from some other pool or from another
|
|
|
|
* version of persistent L2ARC.
|
|
|
|
*/
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reads L2ARC log blocks from storage and validates their contents.
|
|
|
|
*
|
|
|
|
* This function implements a simple fetcher to make sure that while
|
|
|
|
* we're processing one buffer the L2ARC is already fetching the next
|
|
|
|
* one in the chain.
|
|
|
|
*
|
|
|
|
* The arguments this_lp and next_lp point to the current and next log block
|
|
|
|
* address in the block chain. Similarly, this_lb and next_lb hold the
|
|
|
|
* l2arc_log_blk_phys_t's of the current and next L2ARC blk.
|
|
|
|
*
|
|
|
|
* The `this_io' and `next_io' arguments are used for block fetching.
|
|
|
|
* When issuing the first blk IO during rebuild, you should pass NULL for
|
|
|
|
* `this_io'. This function will then issue a sync IO to read the block and
|
|
|
|
* also issue an async IO to fetch the next block in the block chain. The
|
|
|
|
* fetched IO is returned in `next_io'. On subsequent calls to this
|
|
|
|
* function, pass the value returned in `next_io' from the previous call
|
|
|
|
* as `this_io' and a fresh `next_io' pointer to hold the next fetch IO.
|
|
|
|
* Prior to the call, you should initialize your `next_io' pointer to be
|
|
|
|
* NULL. If no fetch IO was issued, the pointer is left set at NULL.
|
|
|
|
*
|
|
|
|
* On success, this function returns 0, otherwise it returns an appropriate
|
|
|
|
* error code. On error the fetching IO is aborted and cleared before
|
|
|
|
* returning from this function. Therefore, if we return `success', the
|
|
|
|
* caller can assume that we have taken care of cleanup of fetch IOs.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
l2arc_log_blk_read(l2arc_dev_t *dev,
|
|
|
|
const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,
|
|
|
|
l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,
|
|
|
|
zio_t *this_io, zio_t **next_io)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
zio_cksum_t cksum;
|
|
|
|
abd_t *abd = NULL;
|
2020-05-08 02:34:03 +03:00
|
|
|
uint64_t asize;
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
ASSERT(this_lbp != NULL && next_lbp != NULL);
|
|
|
|
ASSERT(this_lb != NULL && next_lb != NULL);
|
|
|
|
ASSERT(next_io != NULL && *next_io == NULL);
|
|
|
|
ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check to see if we have issued the IO for this log block in a
|
|
|
|
* previous run. If not, this is the first call, so issue it now.
|
|
|
|
*/
|
|
|
|
if (this_io == NULL) {
|
|
|
|
this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp,
|
|
|
|
this_lb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Peek to see if we can start issuing the next IO immediately.
|
|
|
|
*/
|
|
|
|
if (l2arc_log_blkptr_valid(dev, next_lbp)) {
|
|
|
|
/*
|
|
|
|
* Start issuing IO for the next log block early - this
|
|
|
|
* should help keep the L2ARC device busy while we
|
|
|
|
* decompress and restore this log block.
|
|
|
|
*/
|
|
|
|
*next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp,
|
|
|
|
next_lb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Wait for the IO to read this log block to complete */
|
|
|
|
if ((err = zio_wait(this_io)) != 0) {
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);
|
|
|
|
zfs_dbgmsg("L2ARC IO error (%d) while reading log block, "
|
2021-06-23 07:53:45 +03:00
|
|
|
"offset: %llu, vdev guid: %llu", err,
|
|
|
|
(u_longlong_t)this_lbp->lbp_daddr,
|
|
|
|
(u_longlong_t)dev->l2ad_vdev->vdev_guid);
|
2020-04-10 20:33:35 +03:00
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
/*
|
|
|
|
* Make sure the buffer checks out.
|
|
|
|
* L2BLK_GET_PSIZE returns aligned size for log blocks.
|
|
|
|
*/
|
|
|
|
asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop);
|
|
|
|
fletcher_4_native(this_lb, asize, NULL, &cksum);
|
2020-04-10 20:33:35 +03:00
|
|
|
if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors);
|
|
|
|
zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, "
|
|
|
|
"vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu",
|
2021-06-23 07:53:45 +03:00
|
|
|
(u_longlong_t)this_lbp->lbp_daddr,
|
|
|
|
(u_longlong_t)dev->l2ad_vdev->vdev_guid,
|
|
|
|
(u_longlong_t)dev->l2ad_hand,
|
|
|
|
(u_longlong_t)dev->l2ad_evict);
|
2020-04-10 20:33:35 +03:00
|
|
|
err = SET_ERROR(ECKSUM);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Now we can take our time decoding this buffer */
|
|
|
|
switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) {
|
|
|
|
case ZIO_COMPRESS_OFF:
|
|
|
|
break;
|
|
|
|
case ZIO_COMPRESS_LZ4:
|
2020-05-08 02:34:03 +03:00
|
|
|
abd = abd_alloc_for_io(asize, B_TRUE);
|
|
|
|
abd_copy_from_buf_off(abd, this_lb, 0, asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
if ((err = zio_decompress_data(
|
|
|
|
L2BLK_GET_COMPRESS((this_lbp)->lbp_prop),
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
abd, this_lb, asize, sizeof (*this_lb), NULL)) != 0) {
|
2020-04-10 20:33:35 +03:00
|
|
|
err = SET_ERROR(EINVAL);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
err = SET_ERROR(EINVAL);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))
|
|
|
|
byteswap_uint64_array(this_lb, sizeof (*this_lb));
|
|
|
|
if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {
|
|
|
|
err = SET_ERROR(EINVAL);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
cleanup:
|
|
|
|
/* Abort an in-flight fetch I/O in case of error */
|
|
|
|
if (err != 0 && *next_io != NULL) {
|
|
|
|
l2arc_log_blk_fetch_abort(*next_io);
|
|
|
|
*next_io = NULL;
|
|
|
|
}
|
|
|
|
if (abd != NULL)
|
|
|
|
abd_free(abd);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restores the payload of a log block to ARC. This creates empty ARC hdr
|
|
|
|
* entries which only contain an l2arc hdr, essentially restoring the
|
|
|
|
* buffers to their L2ARC evicted state. This function also updates space
|
|
|
|
* usage on the L2ARC vdev to make sure it tracks restored buffers.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb,
|
2020-10-06 01:29:05 +03:00
|
|
|
uint64_t lb_asize)
|
2020-04-10 20:33:35 +03:00
|
|
|
{
|
2020-05-08 02:34:03 +03:00
|
|
|
uint64_t size = 0, asize = 0;
|
|
|
|
uint64_t log_entries = dev->l2ad_log_entries;
|
2020-04-10 20:33:35 +03:00
|
|
|
|
2020-08-26 00:33:36 +03:00
|
|
|
/*
|
|
|
|
* Usually arc_adapt() is called only for data, not headers, but
|
|
|
|
* since we may allocate significant amount of memory here, let ARC
|
|
|
|
* grow its arc_c.
|
|
|
|
*/
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
arc_adapt(log_entries * HDR_L2ONLY_SIZE);
|
2020-08-26 00:33:36 +03:00
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
for (int i = log_entries - 1; i >= 0; i--) {
|
|
|
|
/*
|
|
|
|
* Restore goes in the reverse temporal direction to preserve
|
|
|
|
* correct temporal ordering of buffers in the l2ad_buflist.
|
|
|
|
* l2arc_hdr_restore also does a list_insert_tail instead of
|
|
|
|
* list_insert_head on the l2ad_buflist:
|
|
|
|
*
|
|
|
|
* LIST l2ad_buflist LIST
|
|
|
|
* HEAD <------ (time) ------ TAIL
|
|
|
|
* direction +-----+-----+-----+-----+-----+ direction
|
|
|
|
* of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild
|
|
|
|
* fill +-----+-----+-----+-----+-----+
|
|
|
|
* ^ ^
|
|
|
|
* | |
|
|
|
|
* | |
|
2020-05-08 02:34:03 +03:00
|
|
|
* l2arc_feed_thread l2arc_rebuild
|
|
|
|
* will place new bufs here restores bufs here
|
2020-04-10 20:33:35 +03:00
|
|
|
*
|
2020-05-08 02:34:03 +03:00
|
|
|
* During l2arc_rebuild() the device is not used by
|
|
|
|
* l2arc_feed_thread() as dev->l2ad_rebuild is set to true.
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
|
|
|
size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop);
|
2020-05-08 02:34:03 +03:00
|
|
|
asize += vdev_psize_to_asize(dev->l2ad_vdev,
|
|
|
|
L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop));
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_hdr_restore(&lb->lb_entries[i], dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Record rebuild stats:
|
|
|
|
* size Logical size of restored buffers in the L2ARC
|
2020-05-08 02:34:03 +03:00
|
|
|
* asize Aligned size of restored buffers in the L2ARC
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
|
|
|
ARCSTAT_INCR(arcstat_l2_rebuild_size, size);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize);
|
|
|
|
ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Restores a single ARC buf hdr from a log entry. The ARC buffer is put
|
|
|
|
* into a state indicating that it has been evicted to L2ARC.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr, *exists;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop);
|
|
|
|
uint64_t asize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do all the allocation before grabbing any locks, this lets us
|
|
|
|
* sleep if memory is full and we don't have to deal with failed
|
|
|
|
* allocations.
|
|
|
|
*/
|
|
|
|
hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type,
|
|
|
|
dev, le->le_dva, le->le_daddr,
|
|
|
|
L2BLK_GET_PSIZE((le)->le_prop), le->le_birth,
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel,
|
2020-04-10 20:33:35 +03:00
|
|
|
L2BLK_GET_PROTECTED((le)->le_prop),
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
L2BLK_GET_PREFETCH((le)->le_prop),
|
|
|
|
L2BLK_GET_STATE((le)->le_prop));
|
2020-04-10 20:33:35 +03:00
|
|
|
asize = vdev_psize_to_asize(dev->l2ad_vdev,
|
|
|
|
L2BLK_GET_PSIZE((le)->le_prop));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* vdev_space_update() has to be called before arc_hdr_destroy() to
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
* avoid underflow since the latter also calls vdev_space_update().
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
l2arc_hdr_arcstats_increment(hdr);
|
2020-04-10 20:33:35 +03:00
|
|
|
vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
|
|
|
|
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
list_insert_tail(&dev->l2ad_buflist, hdr);
|
|
|
|
(void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
|
|
|
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
if (exists) {
|
|
|
|
/* Buffer was already cached, no need to restore it. */
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
/*
|
|
|
|
* If the buffer is already cached, check whether it has
|
|
|
|
* L2ARC metadata. If not, enter them and update the flag.
|
|
|
|
* This is important is case of onlining a cache device, since
|
|
|
|
* we previously evicted all L2ARC metadata from ARC.
|
|
|
|
*/
|
|
|
|
if (!HDR_HAS_L2HDR(exists)) {
|
|
|
|
arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR);
|
|
|
|
exists->b_l2hdr.b_dev = dev;
|
|
|
|
exists->b_l2hdr.b_daddr = le->le_daddr;
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
exists->b_l2hdr.b_arcs_state =
|
|
|
|
L2BLK_GET_STATE((le)->le_prop);
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
list_insert_tail(&dev->l2ad_buflist, exists);
|
|
|
|
(void) zfs_refcount_add_many(&dev->l2ad_alloc,
|
|
|
|
arc_hdr_size(exists), exists);
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
l2arc_hdr_arcstats_increment(exists);
|
2020-04-10 20:33:35 +03:00
|
|
|
vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
|
|
|
|
}
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Starts an asynchronous read IO to read a log block. This is used in log
|
|
|
|
* block reconstruction to start reading the next block before we are done
|
|
|
|
* decoding and reconstructing the current block, to keep the l2arc device
|
|
|
|
* nice and hot with read IO to process.
|
|
|
|
* The returned zio will contain a newly allocated memory buffers for the IO
|
|
|
|
* data which should then be freed by the caller once the zio is no longer
|
|
|
|
* needed (i.e. due to it having completed). If you wish to abort this
|
|
|
|
* zio, you should do so using l2arc_log_blk_fetch_abort, which takes
|
|
|
|
* care of disposing of the allocated buffers correctly.
|
|
|
|
*/
|
|
|
|
static zio_t *
|
|
|
|
l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,
|
|
|
|
l2arc_log_blk_phys_t *lb)
|
|
|
|
{
|
2020-05-08 02:34:03 +03:00
|
|
|
uint32_t asize;
|
2020-04-10 20:33:35 +03:00
|
|
|
zio_t *pio;
|
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
/* L2BLK_GET_PSIZE returns aligned size for log blocks */
|
|
|
|
asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);
|
|
|
|
ASSERT(asize <= sizeof (l2arc_log_blk_phys_t));
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP);
|
2020-05-08 02:34:03 +03:00
|
|
|
cb->l2rcb_abd = abd_get_from_buf(lb, asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb,
|
2023-06-09 22:40:55 +03:00
|
|
|
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY);
|
2020-05-08 02:34:03 +03:00
|
|
|
(void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize,
|
2020-04-10 20:33:35 +03:00
|
|
|
cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL,
|
2023-06-09 22:40:55 +03:00
|
|
|
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL |
|
2020-04-10 20:33:35 +03:00
|
|
|
ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));
|
|
|
|
|
|
|
|
return (pio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Aborts a zio returned from l2arc_log_blk_fetch and frees the data
|
|
|
|
* buffers allocated for it.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_log_blk_fetch_abort(zio_t *zio)
|
|
|
|
{
|
|
|
|
(void) zio_wait(zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-07-11 00:10:03 +03:00
|
|
|
* Creates a zio to update the device header on an l2arc device.
|
2020-04-10 20:33:35 +03:00
|
|
|
*/
|
2020-06-09 20:15:08 +03:00
|
|
|
void
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_dev_hdr_update(l2arc_dev_t *dev)
|
|
|
|
{
|
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
|
|
|
const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;
|
|
|
|
abd_t *abd;
|
|
|
|
int err;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER));
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC;
|
|
|
|
l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION;
|
|
|
|
l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);
|
|
|
|
l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid;
|
2020-05-08 02:34:03 +03:00
|
|
|
l2dhdr->dh_log_entries = dev->l2ad_log_entries;
|
2020-04-10 20:33:35 +03:00
|
|
|
l2dhdr->dh_evict = dev->l2ad_evict;
|
|
|
|
l2dhdr->dh_start = dev->l2ad_start;
|
|
|
|
l2dhdr->dh_end = dev->l2ad_end;
|
2020-05-08 02:34:03 +03:00
|
|
|
l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize);
|
|
|
|
l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count);
|
2020-04-10 20:33:35 +03:00
|
|
|
l2dhdr->dh_flags = 0;
|
2020-06-09 20:15:08 +03:00
|
|
|
l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time;
|
|
|
|
l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state;
|
2020-04-10 20:33:35 +03:00
|
|
|
if (dev->l2ad_first)
|
|
|
|
l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;
|
|
|
|
|
|
|
|
abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);
|
|
|
|
|
|
|
|
err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,
|
|
|
|
VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,
|
|
|
|
NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));
|
|
|
|
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(abd);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
if (err != 0) {
|
|
|
|
zfs_dbgmsg("L2ARC IO error (%d) while writing device header, "
|
2021-06-23 07:53:45 +03:00
|
|
|
"vdev guid: %llu", err,
|
|
|
|
(u_longlong_t)dev->l2ad_vdev->vdev_guid);
|
2020-04-10 20:33:35 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commits a log block to the L2ARC device. This routine is invoked from
|
|
|
|
* l2arc_write_buffers when the log block fills up.
|
|
|
|
* This function allocates some memory to temporarily hold the serialized
|
|
|
|
* buffer to be written. This is then released in l2arc_write_done.
|
|
|
|
*/
|
2023-06-06 22:32:37 +03:00
|
|
|
static uint64_t
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb)
|
|
|
|
{
|
|
|
|
l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;
|
|
|
|
l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;
|
|
|
|
uint64_t psize, asize;
|
|
|
|
zio_t *wzio;
|
|
|
|
l2arc_lb_abd_buf_t *abd_buf;
|
2023-02-28 01:41:02 +03:00
|
|
|
uint8_t *tmpbuf = NULL;
|
2020-04-10 20:33:35 +03:00
|
|
|
l2arc_lb_ptr_buf_t *lb_ptr_buf;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
abd_buf = zio_buf_alloc(sizeof (*abd_buf));
|
|
|
|
abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb));
|
|
|
|
lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);
|
|
|
|
lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP);
|
|
|
|
|
|
|
|
/* link the buffer into the block chain */
|
|
|
|
lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1];
|
|
|
|
lb->lb_magic = L2ARC_LOG_BLK_MAGIC;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
/*
|
|
|
|
* l2arc_log_blk_commit() may be called multiple times during a single
|
|
|
|
* l2arc_write_buffers() call. Save the allocated abd buffers in a list
|
|
|
|
* so we can free them in l2arc_write_done() later on.
|
|
|
|
*/
|
2020-04-10 20:33:35 +03:00
|
|
|
list_insert_tail(&cb->l2wcb_abd_list, abd_buf);
|
2020-05-08 02:34:03 +03:00
|
|
|
|
|
|
|
/* try to compress the buffer */
|
2020-04-10 20:33:35 +03:00
|
|
|
psize = zio_compress_data(ZIO_COMPRESS_LZ4,
|
2023-02-28 01:41:02 +03:00
|
|
|
abd_buf->abd, (void **) &tmpbuf, sizeof (*lb), 0);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/* a log block is never entirely zero */
|
|
|
|
ASSERT(psize != 0);
|
|
|
|
asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);
|
|
|
|
ASSERT(asize <= sizeof (*lb));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update the start log block pointer in the device header to point
|
|
|
|
* to the log block we're about to write.
|
|
|
|
*/
|
|
|
|
l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0];
|
|
|
|
l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;
|
|
|
|
l2dhdr->dh_start_lbps[0].lbp_payload_asize =
|
|
|
|
dev->l2ad_log_blk_payload_asize;
|
|
|
|
l2dhdr->dh_start_lbps[0].lbp_payload_start =
|
|
|
|
dev->l2ad_log_blk_payload_start;
|
|
|
|
L2BLK_SET_LSIZE(
|
|
|
|
(&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb));
|
|
|
|
L2BLK_SET_PSIZE(
|
|
|
|
(&l2dhdr->dh_start_lbps[0])->lbp_prop, asize);
|
|
|
|
L2BLK_SET_CHECKSUM(
|
|
|
|
(&l2dhdr->dh_start_lbps[0])->lbp_prop,
|
|
|
|
ZIO_CHECKSUM_FLETCHER_4);
|
|
|
|
if (asize < sizeof (*lb)) {
|
|
|
|
/* compression succeeded */
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(tmpbuf + psize, 0, asize - psize);
|
2020-04-10 20:33:35 +03:00
|
|
|
L2BLK_SET_COMPRESS(
|
|
|
|
(&l2dhdr->dh_start_lbps[0])->lbp_prop,
|
|
|
|
ZIO_COMPRESS_LZ4);
|
|
|
|
} else {
|
|
|
|
/* compression failed */
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(tmpbuf, lb, sizeof (*lb));
|
2020-04-10 20:33:35 +03:00
|
|
|
L2BLK_SET_COMPRESS(
|
|
|
|
(&l2dhdr->dh_start_lbps[0])->lbp_prop,
|
|
|
|
ZIO_COMPRESS_OFF);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* checksum what we're about to write */
|
|
|
|
fletcher_4_native(tmpbuf, asize, NULL,
|
|
|
|
&l2dhdr->dh_start_lbps[0].lbp_cksum);
|
|
|
|
|
2021-01-20 22:24:37 +03:00
|
|
|
abd_free(abd_buf->abd);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
/* perform the write itself */
|
|
|
|
abd_buf->abd = abd_get_from_buf(tmpbuf, sizeof (*lb));
|
|
|
|
abd_take_ownership_of_buf(abd_buf->abd, B_TRUE);
|
|
|
|
wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,
|
|
|
|
asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL,
|
|
|
|
ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);
|
|
|
|
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);
|
|
|
|
(void) zio_nowait(wzio);
|
|
|
|
|
|
|
|
dev->l2ad_hand += asize;
|
|
|
|
/*
|
|
|
|
* Include the committed log block's pointer in the list of pointers
|
|
|
|
* to log blocks present in the L2ARC device.
|
|
|
|
*/
|
2022-02-25 16:26:54 +03:00
|
|
|
memcpy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[0],
|
2020-04-10 20:33:35 +03:00
|
|
|
sizeof (l2arc_log_blkptr_t));
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_log_blk_count);
|
|
|
|
zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);
|
|
|
|
zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);
|
2020-04-10 20:33:35 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
|
|
|
vdev_space_update(dev->l2ad_vdev, asize, 0, 0);
|
|
|
|
|
|
|
|
/* bump the kstats */
|
|
|
|
ARCSTAT_INCR(arcstat_l2_write_bytes, asize);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_log_blk_writes);
|
2020-05-08 02:34:03 +03:00
|
|
|
ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,
|
|
|
|
dev->l2ad_log_blk_payload_asize / asize);
|
|
|
|
|
|
|
|
/* start a new log block */
|
|
|
|
dev->l2ad_log_ent_idx = 0;
|
|
|
|
dev->l2ad_log_blk_payload_asize = 0;
|
|
|
|
dev->l2ad_log_blk_payload_start = 0;
|
2023-06-06 22:32:37 +03:00
|
|
|
|
|
|
|
return (asize);
|
2020-04-10 20:33:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Validates an L2ARC log block address to make sure that it can be read
|
|
|
|
* from the provided L2ARC device.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)
|
|
|
|
{
|
2020-05-08 02:34:03 +03:00
|
|
|
/* L2BLK_GET_PSIZE returns aligned size for log blocks */
|
|
|
|
uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);
|
|
|
|
uint64_t end = lbp->lbp_daddr + asize - 1;
|
2020-04-10 20:33:35 +03:00
|
|
|
uint64_t start = lbp->lbp_payload_start;
|
|
|
|
boolean_t evicted = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A log block is valid if all of the following conditions are true:
|
|
|
|
* - it fits entirely (including its payload) between l2ad_start and
|
|
|
|
* l2ad_end
|
|
|
|
* - it has a valid size
|
|
|
|
* - neither the log block itself nor part of its payload was evicted
|
|
|
|
* by l2arc_evict():
|
|
|
|
*
|
|
|
|
* l2ad_hand l2ad_evict
|
|
|
|
* | | lbp_daddr
|
|
|
|
* | start | | end
|
|
|
|
* | | | | |
|
|
|
|
* V V V V V
|
|
|
|
* l2ad_start ============================================ l2ad_end
|
|
|
|
* --------------------------||||
|
|
|
|
* ^ ^
|
|
|
|
* | log block
|
|
|
|
* payload
|
|
|
|
*/
|
|
|
|
|
|
|
|
evicted =
|
|
|
|
l2arc_range_check_overlap(start, end, dev->l2ad_hand) ||
|
|
|
|
l2arc_range_check_overlap(start, end, dev->l2ad_evict) ||
|
|
|
|
l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) ||
|
|
|
|
l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end);
|
|
|
|
|
|
|
|
return (start >= dev->l2ad_start && end <= dev->l2ad_end &&
|
2020-05-08 02:34:03 +03:00
|
|
|
asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) &&
|
2020-04-10 20:33:35 +03:00
|
|
|
(!evicted || dev->l2ad_first));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inserts ARC buffer header `hdr' into the current L2ARC log block on
|
|
|
|
* the device. The buffer being inserted must be present in L2ARC.
|
|
|
|
* Returns B_TRUE if the L2ARC log block is full and needs to be committed
|
|
|
|
* to L2ARC, or B_FALSE if it still has room for more ARC buffers.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;
|
|
|
|
l2arc_log_ent_phys_t *le;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
if (dev->l2ad_log_entries == 0)
|
2020-04-10 20:33:35 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
int index = dev->l2ad_log_ent_idx++;
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
ASSERT3S(index, <, dev->l2ad_log_entries);
|
2020-04-10 20:33:35 +03:00
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
|
|
|
|
le = &lb->lb_entries[index];
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(le, 0, sizeof (*le));
|
2020-04-10 20:33:35 +03:00
|
|
|
le->le_dva = hdr->b_dva;
|
|
|
|
le->le_birth = hdr->b_birth;
|
|
|
|
le->le_daddr = hdr->b_l2hdr.b_daddr;
|
|
|
|
if (index == 0)
|
|
|
|
dev->l2ad_log_blk_payload_start = le->le_daddr;
|
|
|
|
L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr));
|
|
|
|
L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr));
|
|
|
|
L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr));
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
le->le_complevel = hdr->b_complevel;
|
2020-04-10 20:33:35 +03:00
|
|
|
L2BLK_SET_TYPE((le)->le_prop, hdr->b_type);
|
|
|
|
L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr)));
|
|
|
|
L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr)));
|
Add L2ARC arcstats for MFU/MRU buffers and buffer content type
Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their
content type is unknown. Knowing this information may prove beneficial
in adjusting the L2ARC caching policy.
This commit adds L2ARC arcstats that display the aligned size
(in bytes) of L2ARC buffers according to their content type
(data/metadata) and according to their ARC state (MRU/MFU or
prefetch). It also expands the existing evict_l2_eligible arcstat to
differentiate between MFU and MRU buffers.
L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a
buffer, its ARC state (MRU/MFU) is stored in the L2 header
(b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size
(in bytes) of L2ARC buffers according to their ARC state (based on
b_arcs_state). We also account for the case where an L2ARC and ARC
cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize
reflects the alinged size (in bytes) of L2ARC buffers that were cached
while they had the prefetch flag set in ARC. This is dynamically updated
as the prefetch flag of L2ARC buffers changes.
When buffers are evicted from ARC, if they are determined to be L2ARC
eligible then their logical size is recorded in
evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon
eviction.
Persistent L2ARC:
When committing an L2ARC buffer to a log block (L2ARC metadata) its
b_arcs_state and prefetch flag is also stored. If the buffer changes
its arcstate or prefetch flag this is reflected in the above arcstats.
However, the L2ARC metadata cannot currently be updated to reflect this
change.
Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count
this as an MRU buffer. The buffer transitions to MFU. The arcstats are
updated to reflect this. Upon pool re-import or on/offlining the L2ARC
device the arcstats are cleared and the buffer will now be counted as an
MRU buffer, as the L2ARC metadata were not updated.
Bug fix:
- If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of
an ARC buffer. However, prefetches may be issued in a way that
arc_read_done() is bypassed. Instead, move the related code in
l2arc_write_eligible() to account for those cases too.
Also add a test and update manpages for l2arc_mfuonly module parameter,
and update the manpages and code comments for l2arc_noprefetch.
Move persist_l2arc tests to l2arc.
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #10743
2020-09-14 20:10:44 +03:00
|
|
|
L2BLK_SET_STATE((le)->le_prop, hdr->b_l1hdr.b_state->arcs_state);
|
2020-04-10 20:33:35 +03:00
|
|
|
|
|
|
|
dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev,
|
|
|
|
HDR_GET_PSIZE(hdr));
|
|
|
|
|
2020-05-08 02:34:03 +03:00
|
|
|
return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries);
|
2020-04-10 20:33:35 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Checks whether a given L2ARC device address sits in a time-sequential
|
|
|
|
* range. The trick here is that the L2ARC is a rotary buffer, so we can't
|
|
|
|
* just do a range comparison, we need to handle the situation in which the
|
|
|
|
* range wraps around the end of the L2ARC device. Arguments:
|
|
|
|
* bottom -- Lower end of the range to check (written to earlier).
|
|
|
|
* top -- Upper end of the range to check (written to later).
|
|
|
|
* check -- The address for which we want to determine if it sits in
|
|
|
|
* between the top and bottom.
|
|
|
|
*
|
|
|
|
* The 3-way conditional below represents the following cases:
|
|
|
|
*
|
|
|
|
* bottom < top : Sequentially ordered case:
|
|
|
|
* <check>--------+-------------------+
|
|
|
|
* | (overlap here?) |
|
|
|
|
* L2ARC dev V V
|
|
|
|
* |---------------<bottom>============<top>--------------|
|
|
|
|
*
|
|
|
|
* bottom > top: Looped-around case:
|
|
|
|
* <check>--------+------------------+
|
|
|
|
* | (overlap here?) |
|
|
|
|
* L2ARC dev V V
|
|
|
|
* |===============<top>---------------<bottom>===========|
|
|
|
|
* ^ ^
|
|
|
|
* | (or here?) |
|
|
|
|
* +---------------+---------<check>
|
|
|
|
*
|
|
|
|
* top == bottom : Just a single address comparison.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)
|
|
|
|
{
|
|
|
|
if (bottom < top)
|
|
|
|
return (bottom <= check && check <= top);
|
|
|
|
else if (bottom > top)
|
|
|
|
return (check <= top || bottom <= check);
|
|
|
|
else
|
|
|
|
return (check == top);
|
|
|
|
}
|
|
|
|
|
2014-11-13 21:09:05 +03:00
|
|
|
EXPORT_SYMBOL(arc_buf_size);
|
|
|
|
EXPORT_SYMBOL(arc_write);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(arc_read);
|
2013-10-03 04:11:19 +04:00
|
|
|
EXPORT_SYMBOL(arc_buf_info);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(arc_getbuf_func);
|
2011-12-23 00:20:43 +04:00
|
|
|
EXPORT_SYMBOL(arc_add_prune_callback);
|
|
|
|
EXPORT_SYMBOL(arc_remove_prune_callback);
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2021-08-16 18:35:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min,
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2021-08-16 18:35:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max,
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 22:17:23 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_balance, UINT, ZMOD_RW,
|
|
|
|
"Balance between metadata and data on ghost hits.");
|
2015-05-30 17:57:53 +03:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int,
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
param_get_uint, ZMOD_RW, "Seconds before growing ARC size");
|
2011-05-04 02:09:28 +04:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int,
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
param_get_uint, ZMOD_RW, "log2(fraction of ARC to reclaim)");
|
2011-05-04 02:09:28 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW,
|
2022-04-20 23:16:25 +03:00
|
|
|
"Percent of pagecache to reclaim ARC to");
|
2017-03-16 04:34:56 +03:00
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, UINT, ZMOD_RD,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Target average block size");
|
2014-08-20 21:09:40 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW,
|
2022-04-20 23:16:25 +03:00
|
|
|
"Disable compressed ARC buffers");
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int,
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
param_get_uint, ZMOD_RW, "Min life of prefetch block in ms");
|
2017-11-16 04:27:01 +03:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms,
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
param_set_arc_int, param_get_uint, ZMOD_RW,
|
2017-11-16 04:27:01 +03:00
|
|
|
"Min life of prescient prefetched block in ms");
|
2013-07-24 21:14:11 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Max write bytes per interval");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Extra write bytes during device warmup");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Number of max device writes to precache");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Compressed l2arc_headroom multiplier");
|
2013-08-02 00:02:10 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, U64, ZMOD_RW,
|
2020-06-09 20:15:08 +03:00
|
|
|
"TRIM ahead L2ARC write size multiplier");
|
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Seconds between L2ARC writing");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, U64, ZMOD_RW,
|
2019-09-06 00:49:49 +03:00
|
|
|
"Min feed interval in milliseconds");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW,
|
|
|
|
"Skip caching prefetched buffers");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW,
|
|
|
|
"Turbo L2ARC warmup");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
2019-09-06 00:49:49 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW,
|
|
|
|
"No reads during writes");
|
2011-07-08 23:41:57 +04:00
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, UINT, ZMOD_RW,
|
2020-08-26 00:33:36 +03:00
|
|
|
"Percent of ARC size allowed for L2ARC-only headers");
|
|
|
|
|
2020-04-10 20:33:35 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW,
|
|
|
|
"Rebuild the L2ARC when importing a pool");
|
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, U64, ZMOD_RW,
|
2020-04-10 20:33:35 +03:00
|
|
|
"Min size in bytes to write rebuild log blocks in L2ARC");
|
|
|
|
|
2020-09-08 21:44:37 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW,
|
|
|
|
"Cache only MFU data from ARC into L2ARC");
|
|
|
|
|
2021-11-11 23:52:16 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, exclude_special, INT, ZMOD_RW,
|
2022-01-21 19:07:15 +03:00
|
|
|
"Exclude dbufs on special vdevs from being cached to L2ARC if set.");
|
2021-11-11 23:52:16 +03:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int,
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
param_get_uint, ZMOD_RW, "System free memory I/O throttle in bytes");
|
2015-07-28 21:30:00 +03:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_u64,
|
|
|
|
spl_param_get_u64, ZMOD_RW, "System free memory target size in bytes");
|
2015-07-27 23:17:32 +03:00
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_u64,
|
|
|
|
spl_param_get_u64, ZMOD_RW, "Minimum bytes of dnodes in ARC");
|
2016-07-13 15:42:40 +03:00
|
|
|
|
2019-10-27 01:22:19 +03:00
|
|
|
ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent,
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
param_set_arc_int, param_get_uint, ZMOD_RW,
|
2016-08-11 06:15:37 +03:00
|
|
|
"Percent of ARC meta buffers for dnodes");
|
|
|
|
|
Cleanup: 64-bit kernel module parameters should use fixed width types
Various module parameters such as `zfs_arc_max` were originally
`uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long`
for Linux compatibility because Linux's kernel default module parameter
implementation did not support 64-bit types on 32-bit platforms. This
caused problems when porting OpenZFS to Windows because its LLP64 memory
model made `unsigned long` a 32-bit type on 64-bit, which created the
undesireable situation that parameters that should accept 64-bit values
could not on 64-bit Windows.
Upon inspection, it turns out that the Linux kernel module parameter
interface is extensible, such that we are allowed to define our own
types. Rather than maintaining the original type change via hacks to to
continue shrinking module parameters on 32-bit Linux, we implement
support for 64-bit module parameters on Linux.
After doing a review of all 64-bit kernel parameters (found via the man
page and also proposed changes by Andrew Innes), the kernel module
parameters fell into a few groups:
Parameters that were originally 64-bit on Illumos:
* dbuf_cache_max_bytes
* dbuf_metadata_cache_max_bytes
* l2arc_feed_min_ms
* l2arc_feed_secs
* l2arc_headroom
* l2arc_headroom_boost
* l2arc_write_boost
* l2arc_write_max
* metaslab_aliquot
* metaslab_force_ganging
* zfetch_array_rd_sz
* zfs_arc_max
* zfs_arc_meta_limit
* zfs_arc_meta_min
* zfs_arc_min
* zfs_async_block_max_blocks
* zfs_condense_max_obsolete_bytes
* zfs_condense_min_mapping_bytes
* zfs_deadman_checktime_ms
* zfs_deadman_synctime_ms
* zfs_initialize_chunk_size
* zfs_initialize_value
* zfs_lua_max_instrlimit
* zfs_lua_max_memlimit
* zil_slog_bulk
Parameters that were originally 32-bit on Illumos:
* zfs_per_txg_dirty_frees_percent
Parameters that were originally `ssize_t` on Illumos:
* zfs_immediate_write_sz
Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It
has been upgraded to 64-bit.
Parameters that were `long`/`unsigned long` because of Linux/FreeBSD
influence:
* l2arc_rebuild_blocks_min_l2size
* zfs_key_max_salt_uses
* zfs_max_log_walking
* zfs_max_logsm_summary_length
* zfs_metaslab_max_size_cache_sec
* zfs_min_metaslabs_to_flush
* zfs_multihost_interval
* zfs_unflushed_log_block_max
* zfs_unflushed_log_block_min
* zfs_unflushed_log_block_pct
* zfs_unflushed_max_mem_amt
* zfs_unflushed_max_mem_ppm
New parameters that do not exist in Illumos:
* l2arc_trim_ahead
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_arc_sys_free
* zfs_deadman_ziotime_ms
* zfs_delete_blocks
* zfs_history_output_max
* zfs_livelist_max_entries
* zfs_max_async_dedup_frees
* zfs_max_nvlist_src_size
* zfs_rebuild_max_segment
* zfs_rebuild_vdev_limit
* zfs_unflushed_log_txg_max
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
* zfs_vnops_read_chunk_size
* zvol_max_discard_blocks
Rather than clutter the lists with commentary, the module parameters
that need comments are repeated below.
A few parameters were defined in Linux/FreeBSD specific code, where the
use of ulong/long is not an issue for portability, so we leave them
alone:
* zfs_delete_blocks
* zfs_key_max_salt_uses
* zvol_max_discard_blocks
The documentation for a few parameters was found to be incorrect:
* zfs_deadman_checktime_ms - incorrectly documented as int
* zfs_delete_blocks - not documented as Linux only
* zfs_history_output_max - incorrectly documented as int
* zfs_vnops_read_chunk_size - incorrectly documented as long
* zvol_max_discard_blocks - incorrectly documented as ulong
The documentation for these has been fixed, alongside the changes to
document the switch to fixed width types.
In addition, several kernel module parameters were percentages or held
ashift values, so being 64-bit never made sense for them. They have been
downgraded to 32-bit:
* vdev_file_logical_ashift
* vdev_file_physical_ashift
* zfs_arc_dnode_limit_percent
* zfs_arc_dnode_reduce_percent
* zfs_arc_meta_limit_percent
* zfs_per_txg_dirty_frees_percent
* zfs_unflushed_log_block_pct
* zfs_vdev_max_auto_ashift
* zfs_vdev_min_auto_ashift
Of special note are `zfs_vdev_max_auto_ashift` and
`zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`,
and passed to the kernel as `ulong`. This is inherently buggy on big
endian 32-bit Linux, since the values would not be written to the
correct locations. 32-bit FreeBSD was unaffected because its sysctl code
correctly treated this as a `uint64_t`.
Lastly, a code comment suggests that `zfs_arc_sys_free` is
Linux-specific, but there is nothing to indicate to me that it is
Linux-specific. Nothing was done about that.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Original-patch-by: Andrew Innes <andrew.c12@gmail.com>
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13984
Closes #14004
2022-10-03 22:06:54 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, UINT, ZMOD_RW,
|
2016-07-13 15:42:40 +03:00
|
|
|
"Percentage of excess dnodes to try to unpin");
|
Revise ARC shrinker algorithm
The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc5 ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #10600
2020-08-01 07:10:52 +03:00
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, UINT, ZMOD_RW,
|
2020-10-22 20:18:26 +03:00
|
|
|
"When full, ARC allocation waits for eviction of this % of alloc size");
|
|
|
|
|
Cleanup: Specify unsignedness on things that should not be signed
In #13871, zfs_vdev_aggregation_limit_non_rotating and
zfs_vdev_aggregation_limit being signed was pointed out as a possible
reason not to eliminate an unnecessary MAX(unsigned, 0) since the
unsigned value was assigned from them.
There is no reason for these module parameters to be signed and upon
inspection, it was found that there are a number of other module
parameters that are signed, but should not be, so we make them unsigned.
Making them unsigned made it clear that some other variables in the code
should also be unsigned, so we also make those unsigned. This prevents
users from setting negative values that could potentially cause bad
behaviors. It also makes the code slightly easier to understand.
Mostly module parameters that deal with timeouts, limits, bitshifts and
percentages are made unsigned by this. Any that are boolean are left
signed, since whether booleans should be considered signed or unsigned
does not matter.
Making zfs_arc_lotsfree_percent unsigned caused a
`zfs_arc_lotsfree_percent >= 0` check to become redundant, so it was
removed. Removing the check was also necessary to prevent a compiler
error from -Werror=type-limits.
Several end of line comments had to be moved to their own lines because
replacing int with uint_t caused us to exceed the 80 character limit
enforced by cstyle.pl.
The following were kept signed because they are passed to
taskq_create(), which expects signed values and modifying the
OpenSolaris/Illumos DDI is out of scope of this patch:
* metaslab_load_pct
* zfs_sync_taskq_batch_pct
* zfs_zil_clean_taskq_nthr_pct
* zfs_zil_clean_taskq_minalloc
* zfs_zil_clean_taskq_maxalloc
* zfs_arc_prune_task_threads
Also, negative values in those parameters was found to be harmless.
The following were left signed because either negative values make
sense, or more analysis was needed to determine whether negative values
should be disallowed:
* zfs_metaslab_switch_threshold
* zfs_pd_bytes_max
* zfs_livelist_min_percent_shared
zfs_multihost_history was made static to be consistent with other
parameters.
A number of module parameters were marked as signed, but in reality
referenced unsigned variables. upgrade_errlog_limit is one of the
numerous examples. In the case of zfs_vdev_async_read_max_active, it was
already uint32_t, but zdb had an extern int declaration for it.
Interestingly, the documentation in zfs.4 was right for
upgrade_errlog_limit despite the module parameter being wrongly marked,
while the documentation for zfs_vdev_async_read_max_active (and friends)
was wrong. It was also wrong for zstd_abort_size, which was unsigned,
but was documented as signed.
Also, the documentation in zfs.4 incorrectly described the following
parameters as ulong when they were int:
* zfs_arc_meta_adjust_restarts
* zfs_override_estimate_recordsize
They are now uint_t as of this patch and thus the man page has been
updated to describe them as uint.
dbuf_state_index was left alone since it does nothing and perhaps should
be removed in another patch.
If any module parameters were missed, they were not found by `grep -r
'ZFS_MODULE_PARAM' | grep ', INT'`. I did find a few that grep missed,
but only because they were in files that had hits.
This patch intentionally did not attempt to address whether some of
these module parameters should be elevated to 64-bit parameters, because
the length of a long on 32-bit is 32-bit.
Lastly, it was pointed out during review that uint_t is a better match
for these variables than uint32_t because FreeBSD kernel parameter
definitions are designed for uint_t, whose bit width can change in
future memory models. As a result, we change the existing parameters
that are uint32_t to use uint_t.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes #13875
2022-09-28 02:42:41 +03:00
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batch_limit, UINT, ZMOD_RW,
|
2020-10-22 20:18:26 +03:00
|
|
|
"The number of headers to evict per sublist before moving to the next");
|
2021-12-23 04:07:13 +03:00
|
|
|
|
|
|
|
ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, prune_task_threads, INT, ZMOD_RW,
|
|
|
|
"Number of arc_prune threads");
|