2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2015-06-27 01:14:45 +03:00
|
|
|
* Copyright (c) 2012, Joyent, Inc. All rights reserved.
|
2016-05-15 18:02:28 +03:00
|
|
|
* Copyright (c) 2011, 2016 by Delphix. All rights reserved.
|
2015-06-27 01:14:45 +03:00
|
|
|
* Copyright (c) 2014 by Saso Kiselkov. All rights reserved.
|
2016-06-02 07:04:53 +03:00
|
|
|
* Copyright 2015 Nexenta Systems, Inc. All rights reserved.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* DVA-based Adjustable Replacement Cache
|
|
|
|
*
|
|
|
|
* While much of the theory of operation used here is
|
|
|
|
* based on the self-tuning, low overhead replacement cache
|
|
|
|
* presented by Megiddo and Modha at FAST 2003, there are some
|
|
|
|
* significant differences:
|
|
|
|
*
|
|
|
|
* 1. The Megiddo and Modha model assumes any page is evictable.
|
|
|
|
* Pages in its cache cannot be "locked" into memory. This makes
|
|
|
|
* the eviction algorithm simple: evict the last page in the list.
|
|
|
|
* This also make the performance characteristics easy to reason
|
|
|
|
* about. Our cache is not so simple. At any given moment, some
|
|
|
|
* subset of the blocks in the cache are un-evictable because we
|
|
|
|
* have handed out a reference to them. Blocks are only evictable
|
|
|
|
* when there are no external references active. This makes
|
|
|
|
* eviction far more problematic: we choose to evict the evictable
|
|
|
|
* blocks that are the "lowest" in the list.
|
|
|
|
*
|
|
|
|
* There are times when it is not possible to evict the requested
|
|
|
|
* space. In these circumstances we are unable to adjust the cache
|
|
|
|
* size. To prevent the cache growing unbounded at these times we
|
|
|
|
* implement a "cache throttle" that slows the flow of new data
|
|
|
|
* into the cache until we can make space available.
|
|
|
|
*
|
|
|
|
* 2. The Megiddo and Modha model assumes a fixed cache size.
|
|
|
|
* Pages are evicted when the cache is full and there is a cache
|
|
|
|
* miss. Our model has a variable sized cache. It grows with
|
|
|
|
* high use, but also tries to react to memory pressure from the
|
|
|
|
* operating system: decreasing its size when system memory is
|
|
|
|
* tight.
|
|
|
|
*
|
|
|
|
* 3. The Megiddo and Modha model assumes a fixed page size. All
|
2013-06-11 21:12:34 +04:00
|
|
|
* elements of the cache are therefore exactly the same size. So
|
2008-11-20 23:01:55 +03:00
|
|
|
* when adjusting the cache size following a cache miss, its simply
|
|
|
|
* a matter of choosing a single page to evict. In our model, we
|
|
|
|
* have variable sized cache blocks (rangeing from 512 bytes to
|
2013-06-11 21:12:34 +04:00
|
|
|
* 128K bytes). We therefore choose a set of blocks to evict to make
|
2008-11-20 23:01:55 +03:00
|
|
|
* space for a cache miss that approximates as closely as possible
|
|
|
|
* the space used by the new block.
|
|
|
|
*
|
|
|
|
* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
|
|
|
|
* by N. Megiddo & D. Modha, FAST 2003
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The locking model:
|
|
|
|
*
|
|
|
|
* A new reference to a cache buffer can be obtained in two
|
|
|
|
* ways: 1) via a hash table lookup using the DVA as a key,
|
|
|
|
* or 2) via one of the ARC lists. The arc_read() interface
|
2016-07-11 20:45:52 +03:00
|
|
|
* uses method 1, while the internal ARC algorithms for
|
2013-06-11 21:12:34 +04:00
|
|
|
* adjusting the cache use method 2. We therefore provide two
|
2008-11-20 23:01:55 +03:00
|
|
|
* types of locks: 1) the hash table lock array, and 2) the
|
2016-07-11 20:45:52 +03:00
|
|
|
* ARC list locks.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2013-01-11 20:54:18 +04:00
|
|
|
* Buffers do not have their own mutexes, rather they rely on the
|
|
|
|
* hash table mutexes for the bulk of their protection (i.e. most
|
|
|
|
* fields in the arc_buf_hdr_t are protected by these mutexes).
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* buf_hash_find() returns the appropriate mutex (held) when it
|
|
|
|
* locates the requested buffer in the hash table. It returns
|
|
|
|
* NULL for the mutex if the buffer was not in the table.
|
|
|
|
*
|
|
|
|
* buf_hash_remove() expects the appropriate hash mutex to be
|
|
|
|
* already held before it is invoked.
|
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Each ARC state also has a mutex which is used to protect the
|
2008-11-20 23:01:55 +03:00
|
|
|
* buffer list associated with the state. When attempting to
|
2016-07-11 20:45:52 +03:00
|
|
|
* obtain a hash table lock while holding an ARC list lock you
|
2008-11-20 23:01:55 +03:00
|
|
|
* must use: mutex_tryenter() to avoid deadlock. Also note that
|
|
|
|
* the active state mutex must be held before the ghost state mutex.
|
|
|
|
*
|
2011-12-23 00:20:43 +04:00
|
|
|
* It as also possible to register a callback which is run when the
|
|
|
|
* arc_meta_limit is reached and no buffers can be safely evicted. In
|
|
|
|
* this case the arc user should drop a reference on some arc buffers so
|
|
|
|
* they can be reclaimed and the arc_meta_limit honored. For example,
|
|
|
|
* when using the ZPL each dentry holds a references on a znode. These
|
|
|
|
* dentries must be pruned before the arc buffer holding the znode can
|
|
|
|
* be safely evicted.
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* Note that the majority of the performance stats are manipulated
|
|
|
|
* with atomic operations.
|
|
|
|
*
|
2014-12-30 06:12:23 +03:00
|
|
|
* The L2ARC uses the l2ad_mtx on each vdev for the following:
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* - L2ARC buflist creation
|
|
|
|
* - L2ARC buflist eviction
|
|
|
|
* - L2ARC write completion, which walks L2ARC buflists
|
|
|
|
* - ARC header destruction, as it removes from L2ARC buflists
|
|
|
|
* - ARC header release, as it removes from L2ARC buflists
|
|
|
|
*/
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* ARC operation:
|
|
|
|
*
|
|
|
|
* Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.
|
|
|
|
* This structure can point either to a block that is still in the cache or to
|
|
|
|
* one that is only accessible in an L2 ARC device, or it can provide
|
|
|
|
* information about a block that was recently evicted. If a block is
|
|
|
|
* only accessible in the L2ARC, then the arc_buf_hdr_t only has enough
|
|
|
|
* information to retrieve it from the L2ARC device. This information is
|
|
|
|
* stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block
|
|
|
|
* that is in this state cannot access the data directly.
|
|
|
|
*
|
|
|
|
* Blocks that are actively being referenced or have not been evicted
|
|
|
|
* are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within
|
|
|
|
* the arc_buf_hdr_t that will point to the data block in memory. A block can
|
|
|
|
* only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC
|
2016-07-11 20:45:52 +03:00
|
|
|
* caches data in two ways -- in a list of ARC buffers (arc_buf_t) and
|
2016-06-02 07:04:53 +03:00
|
|
|
* also in the arc_buf_hdr_t's private physical data block pointer (b_pdata).
|
2016-07-11 20:45:52 +03:00
|
|
|
*
|
|
|
|
* The L1ARC's data pointer may or may not be uncompressed. The ARC has the
|
|
|
|
* ability to store the physical data (b_pdata) associated with the DVA of the
|
|
|
|
* arc_buf_hdr_t. Since the b_pdata is a copy of the on-disk physical block,
|
|
|
|
* it will match its on-disk compression characteristics. This behavior can be
|
|
|
|
* disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the
|
|
|
|
* compressed ARC functionality is disabled, the b_pdata will point to an
|
|
|
|
* uncompressed version of the on-disk data.
|
|
|
|
*
|
|
|
|
* Data in the L1ARC is not accessed by consumers of the ARC directly. Each
|
|
|
|
* arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.
|
|
|
|
* Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC
|
|
|
|
* consumer. The ARC will provide references to this data and will keep it
|
|
|
|
* cached until it is no longer in use. The ARC caches only the L1ARC's physical
|
|
|
|
* data block and will evict any arc_buf_t that is no longer referenced. The
|
|
|
|
* amount of memory consumed by the arc_buf_ts' data buffers can be seen via the
|
2016-06-02 07:04:53 +03:00
|
|
|
* "overhead_size" kstat.
|
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Depending on the consumer, an arc_buf_t can be requested in uncompressed or
|
|
|
|
* compressed form. The typical case is that consumers will want uncompressed
|
|
|
|
* data, and when that happens a new data buffer is allocated where the data is
|
|
|
|
* decompressed for them to use. Currently the only consumer who wants
|
|
|
|
* compressed arc_buf_t's is "zfs send", when it streams data exactly as it
|
|
|
|
* exists on disk. When this happens, the arc_buf_t's data buffer is shared
|
|
|
|
* with the arc_buf_hdr_t.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The
|
|
|
|
* first one is owned by a compressed send consumer (and therefore references
|
|
|
|
* the same compressed data buffer as the arc_buf_hdr_t) and the second could be
|
|
|
|
* used by any other consumer (and has its own uncompressed copy of the data
|
|
|
|
* buffer).
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* arc_buf_hdr_t
|
|
|
|
* +-----------+
|
|
|
|
* | fields |
|
|
|
|
* | common to |
|
|
|
|
* | L1- and |
|
|
|
|
* | L2ARC |
|
|
|
|
* +-----------+
|
|
|
|
* | l2arc_buf_hdr_t
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* | l1arc_buf_hdr_t
|
|
|
|
* | | arc_buf_t
|
|
|
|
* | b_buf +------------>+-----------+ arc_buf_t
|
|
|
|
* | b_pdata +-+ |b_next +---->+-----------+
|
|
|
|
* +-----------+ | |-----------| |b_next +-->NULL
|
|
|
|
* | |b_comp = T | +-----------+
|
|
|
|
* | |b_data +-+ |b_comp = F |
|
|
|
|
* | +-----------+ | |b_data +-+
|
|
|
|
* +->+------+ | +-----------+ |
|
|
|
|
* compressed | | | |
|
|
|
|
* data | |<--------------+ | uncompressed
|
|
|
|
* +------+ compressed, | data
|
|
|
|
* shared +-->+------+
|
|
|
|
* data | |
|
|
|
|
* | |
|
|
|
|
* +------+
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* When a consumer reads a block, the ARC must first look to see if the
|
2016-07-11 20:45:52 +03:00
|
|
|
* arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new
|
|
|
|
* arc_buf_t and either copies uncompressed data into a new data buffer from an
|
|
|
|
* existing uncompressed arc_buf_t, decompresses the hdr's b_pdata buffer into a
|
|
|
|
* new data buffer, or shares the hdr's b_pdata buffer, depending on whether the
|
|
|
|
* hdr is compressed and the desired compression characteristics of the
|
|
|
|
* arc_buf_t consumer. If the arc_buf_t ends up sharing data with the
|
|
|
|
* arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be
|
|
|
|
* the last buffer in the hdr's b_buf list, however a shared compressed buf can
|
|
|
|
* be anywhere in the hdr's list.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* The diagram below shows an example of an uncompressed ARC hdr that is
|
2016-07-11 20:45:52 +03:00
|
|
|
* sharing its data with an arc_buf_t (note that the shared uncompressed buf is
|
|
|
|
* the last element in the buf list):
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* arc_buf_hdr_t
|
|
|
|
* +-----------+
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* l2arc_buf_hdr_t| |
|
|
|
|
* | |
|
|
|
|
* +-----------+
|
|
|
|
* l1arc_buf_hdr_t| |
|
|
|
|
* | | arc_buf_t (shared)
|
|
|
|
* | b_buf +------------>+---------+ arc_buf_t
|
|
|
|
* | | |b_next +---->+---------+
|
|
|
|
* | b_pdata +-+ |---------| |b_next +-->NULL
|
|
|
|
* +-----------+ | | | +---------+
|
|
|
|
* | |b_data +-+ | |
|
|
|
|
* | +---------+ | |b_data +-+
|
|
|
|
* +->+------+ | +---------+ |
|
|
|
|
* | | | |
|
|
|
|
* uncompressed | | | |
|
|
|
|
* data +------+ | |
|
|
|
|
* ^ +->+------+ |
|
|
|
|
* | uncompressed | | |
|
|
|
|
* | data | | |
|
|
|
|
* | +------+ |
|
|
|
|
* +---------------------------------+
|
|
|
|
*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Writing to the ARC requires that the ARC first discard the hdr's b_pdata
|
2016-06-02 07:04:53 +03:00
|
|
|
* since the physical block is about to be rewritten. The new data contents
|
2016-07-11 20:45:52 +03:00
|
|
|
* will be contained in the arc_buf_t. As the I/O pipeline performs the write,
|
|
|
|
* it may compress the data before writing it to disk. The ARC will be called
|
|
|
|
* with the transformed data and will bcopy the transformed on-disk block into
|
|
|
|
* a newly allocated b_pdata. Writes are always done into buffers which have
|
|
|
|
* either been loaned (and hence are new and don't have other readers) or
|
|
|
|
* buffers which have been released (and hence have their own hdr, if there
|
|
|
|
* were originally other readers of the buf's original hdr). This ensures that
|
|
|
|
* the ARC only needs to update a single buf and its hdr after a write occurs.
|
2016-06-02 07:04:53 +03:00
|
|
|
*
|
|
|
|
* When the L2ARC is in use, it will also take advantage of the b_pdata. The
|
|
|
|
* L2ARC will always write the contents of b_pdata to the L2ARC. This means
|
2016-07-11 20:45:52 +03:00
|
|
|
* that when compressed ARC is enabled that the L2ARC blocks are identical
|
2016-06-02 07:04:53 +03:00
|
|
|
* to the on-disk block in the main data pool. This provides a significant
|
|
|
|
* advantage since the ARC can leverage the bp's checksum when reading from the
|
|
|
|
* L2ARC to determine if the contents are valid. However, if the compressed
|
2016-07-11 20:45:52 +03:00
|
|
|
* ARC is disabled, then the L2ARC's block must be transformed to look
|
2016-06-02 07:04:53 +03:00
|
|
|
* like the physical block in the main data pool before comparing the
|
|
|
|
* checksum and determining its validity.
|
|
|
|
*/
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zio.h>
|
2016-06-02 07:04:53 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2013-08-02 00:02:10 +04:00
|
|
|
#include <sys/zio_compress.h>
|
2016-06-02 07:04:53 +03:00
|
|
|
#include <sys/zio_checksum.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/arc.h>
|
2015-06-27 01:14:45 +03:00
|
|
|
#include <sys/refcount.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/vdev.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/vdev_impl.h>
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
#include <sys/dsl_pool.h>
|
2015-01-13 06:52:19 +03:00
|
|
|
#include <sys/multilist.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#ifdef _KERNEL
|
|
|
|
#include <sys/vmsystm.h>
|
|
|
|
#include <vm/anon.h>
|
|
|
|
#include <sys/fs/swapnode.h>
|
2011-12-23 00:20:43 +04:00
|
|
|
#include <sys/zpl.h>
|
2014-11-14 21:21:53 +03:00
|
|
|
#include <linux/mm_compat.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
#include <sys/callb.h>
|
|
|
|
#include <sys/kstat.h>
|
2012-01-20 22:58:57 +04:00
|
|
|
#include <sys/dmu_tx.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <zfs_fletcher.h>
|
2014-10-22 04:59:33 +04:00
|
|
|
#include <sys/arc_impl.h>
|
2014-12-13 05:07:39 +03:00
|
|
|
#include <sys/trace_arc.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-17 01:18:06 +04:00
|
|
|
#ifndef _KERNEL
|
|
|
|
/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
|
|
|
|
boolean_t arc_watch = B_FALSE;
|
|
|
|
#endif
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
static kmutex_t arc_reclaim_lock;
|
|
|
|
static kcondvar_t arc_reclaim_thread_cv;
|
|
|
|
static boolean_t arc_reclaim_thread_exit;
|
|
|
|
static kcondvar_t arc_reclaim_waiters_cv;
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* The number of headers to evict in arc_evict_state_impl() before
|
|
|
|
* dropping the sublist lock and evicting from another sublist. A lower
|
|
|
|
* value means we're more likely to evict the "correct" header (i.e. the
|
|
|
|
* oldest header in the arc state), but comes with higher overhead
|
|
|
|
* (i.e. more invocations of arc_evict_state_impl()).
|
|
|
|
*/
|
|
|
|
int zfs_arc_evict_batch_limit = 10;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The number of sublists used for each of the arc state lists. If this
|
|
|
|
* is not set to a suitable value by the user, it will be configured to
|
|
|
|
* the number of CPUs on the system in arc_init().
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
int zfs_arc_num_sublists_per_state = 0;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* number of seconds before growing cache again */
|
2015-06-26 21:28:18 +03:00
|
|
|
static int arc_grow_retry = 5;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* shift of arc_c for calculating overflow limit in arc_get_data_buf */
|
2015-06-26 21:28:18 +03:00
|
|
|
int zfs_arc_overflow_shift = 8;
|
2014-01-03 22:36:26 +04:00
|
|
|
|
2015-06-27 01:59:23 +03:00
|
|
|
/* shift of arc_c for calculating both min and max arc_p */
|
|
|
|
static int arc_p_min_shift = 4;
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/* log2(fraction of arc to reclaim) */
|
2015-06-26 21:28:18 +03:00
|
|
|
static int arc_shrink_shift = 7;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2015-06-26 21:28:18 +03:00
|
|
|
* log2(fraction of ARC which must be free to allow growing).
|
|
|
|
* I.e. If there is less than arc_c >> arc_no_grow_shift free memory,
|
|
|
|
* when reading a new block into the ARC, we will evict an equal-sized block
|
|
|
|
* from the ARC.
|
|
|
|
*
|
|
|
|
* This must be less than arc_shrink_shift, so that when we shrink the ARC,
|
|
|
|
* we will still not allow it to grow.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-06-26 21:28:18 +03:00
|
|
|
int arc_no_grow_shift = 5;
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2014-08-20 21:09:40 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* minimum lifespan of a prefetch block in clock ticks
|
|
|
|
* (initialized in arc_init())
|
|
|
|
*/
|
2015-06-26 21:28:18 +03:00
|
|
|
static int arc_min_prefetch_lifespan;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* If this percent of memory is free, don't throttle.
|
|
|
|
*/
|
|
|
|
int arc_lotsfree_percent = 10;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int arc_dead;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* The arc has filled available memory and has now warmed up.
|
|
|
|
*/
|
|
|
|
static boolean_t arc_warm;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* log2 fraction of the zio arena to keep free.
|
|
|
|
*/
|
|
|
|
int arc_zio_arena_free_shift = 2;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* These tunables are for performance analysis.
|
|
|
|
*/
|
2010-08-26 22:49:16 +04:00
|
|
|
unsigned long zfs_arc_max = 0;
|
|
|
|
unsigned long zfs_arc_min = 0;
|
|
|
|
unsigned long zfs_arc_meta_limit = 0;
|
2015-01-13 06:52:19 +03:00
|
|
|
unsigned long zfs_arc_meta_min = 0;
|
2016-07-13 15:42:40 +03:00
|
|
|
unsigned long zfs_arc_dnode_limit = 0;
|
|
|
|
unsigned long zfs_arc_dnode_reduce_percent = 10;
|
2015-06-26 21:28:18 +03:00
|
|
|
int zfs_arc_grow_retry = 0;
|
|
|
|
int zfs_arc_shrink_shift = 0;
|
2015-06-27 01:59:23 +03:00
|
|
|
int zfs_arc_p_min_shift = 0;
|
2015-06-26 21:28:18 +03:00
|
|
|
int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
int zfs_compressed_arc_enabled = B_TRUE;
|
|
|
|
|
2016-08-11 06:15:37 +03:00
|
|
|
/*
|
|
|
|
* ARC will evict meta buffers that exceed arc_meta_limit. This
|
|
|
|
* tunable make arc_meta_limit adjustable for different workloads.
|
|
|
|
*/
|
|
|
|
unsigned long zfs_arc_meta_limit_percent = 75;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Percentage that can be consumed by dnodes of ARC meta buffers.
|
|
|
|
*/
|
|
|
|
unsigned long zfs_arc_dnode_limit_percent = 10;
|
|
|
|
|
2015-03-18 01:08:22 +03:00
|
|
|
/*
|
2015-06-26 21:28:18 +03:00
|
|
|
* These tunables are Linux specific
|
2015-03-18 01:08:22 +03:00
|
|
|
*/
|
2015-07-27 23:17:32 +03:00
|
|
|
unsigned long zfs_arc_sys_free = 0;
|
2015-06-26 21:28:18 +03:00
|
|
|
int zfs_arc_min_prefetch_lifespan = 0;
|
|
|
|
int zfs_arc_p_aggressive_disable = 1;
|
|
|
|
int zfs_arc_p_dampener_disable = 1;
|
|
|
|
int zfs_arc_meta_prune = 10000;
|
|
|
|
int zfs_arc_meta_strategy = ARC_STRATEGY_META_BALANCED;
|
|
|
|
int zfs_arc_meta_adjust_restarts = 4096;
|
2015-07-28 21:30:00 +03:00
|
|
|
int zfs_arc_lotsfree_percent = 10;
|
2015-03-18 01:08:22 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* The 6 states: */
|
|
|
|
static arc_state_t ARC_anon;
|
|
|
|
static arc_state_t ARC_mru;
|
|
|
|
static arc_state_t ARC_mru_ghost;
|
|
|
|
static arc_state_t ARC_mfu;
|
|
|
|
static arc_state_t ARC_mfu_ghost;
|
|
|
|
static arc_state_t ARC_l2c_only;
|
|
|
|
|
|
|
|
typedef struct arc_stats {
|
|
|
|
kstat_named_t arcstat_hits;
|
|
|
|
kstat_named_t arcstat_misses;
|
|
|
|
kstat_named_t arcstat_demand_data_hits;
|
|
|
|
kstat_named_t arcstat_demand_data_misses;
|
|
|
|
kstat_named_t arcstat_demand_metadata_hits;
|
|
|
|
kstat_named_t arcstat_demand_metadata_misses;
|
|
|
|
kstat_named_t arcstat_prefetch_data_hits;
|
|
|
|
kstat_named_t arcstat_prefetch_data_misses;
|
|
|
|
kstat_named_t arcstat_prefetch_metadata_hits;
|
|
|
|
kstat_named_t arcstat_prefetch_metadata_misses;
|
|
|
|
kstat_named_t arcstat_mru_hits;
|
|
|
|
kstat_named_t arcstat_mru_ghost_hits;
|
|
|
|
kstat_named_t arcstat_mfu_hits;
|
|
|
|
kstat_named_t arcstat_mfu_ghost_hits;
|
|
|
|
kstat_named_t arcstat_deleted;
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* Number of buffers that could not be evicted because the hash lock
|
|
|
|
* was held by another thread. The lock may not necessarily be held
|
|
|
|
* by something using the same buffer, since hash locks are shared
|
|
|
|
* by multiple buffers.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_mutex_miss;
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* Number of buffers skipped because they have I/O in progress, are
|
|
|
|
* indrect prefetch buffers that have not lived long enough, or are
|
|
|
|
* not from the spa we're trying to evict from.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_evict_skip;
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Number of times arc_evict_state() was unable to evict enough
|
|
|
|
* buffers to reach its target amount.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_evict_not_enough;
|
2010-05-29 00:45:14 +04:00
|
|
|
kstat_named_t arcstat_evict_l2_cached;
|
|
|
|
kstat_named_t arcstat_evict_l2_eligible;
|
|
|
|
kstat_named_t arcstat_evict_l2_ineligible;
|
2015-01-13 06:52:19 +03:00
|
|
|
kstat_named_t arcstat_evict_l2_skip;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_hash_elements;
|
|
|
|
kstat_named_t arcstat_hash_elements_max;
|
|
|
|
kstat_named_t arcstat_hash_collisions;
|
|
|
|
kstat_named_t arcstat_hash_chains;
|
|
|
|
kstat_named_t arcstat_hash_chain_max;
|
|
|
|
kstat_named_t arcstat_p;
|
|
|
|
kstat_named_t arcstat_c;
|
|
|
|
kstat_named_t arcstat_c_min;
|
|
|
|
kstat_named_t arcstat_c_max;
|
|
|
|
kstat_named_t arcstat_size;
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Number of compressed bytes stored in the arc_buf_hdr_t's b_pdata.
|
|
|
|
* Note that the compressed bytes may match the uncompressed bytes
|
|
|
|
* if the block is either not compressed or compressed arc is disabled.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_compressed_size;
|
|
|
|
/*
|
|
|
|
* Uncompressed size of the data stored in b_pdata. If compressed
|
|
|
|
* arc is disabled then this value will be identical to the stat
|
|
|
|
* above.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_uncompressed_size;
|
|
|
|
/*
|
|
|
|
* Number of bytes stored in all the arc_buf_t's. This is classified
|
|
|
|
* as "overhead" since this data is typically short-lived and will
|
|
|
|
* be evicted from the arc when it becomes unreferenced unless the
|
|
|
|
* zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level
|
|
|
|
* values have been set (see comment in dbuf.c for more information).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_overhead_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by internal ARC structures necessary
|
|
|
|
* for tracking purposes; these structures are not actually
|
|
|
|
* backed by ARC buffers. This includes arc_buf_hdr_t structures
|
|
|
|
* (allocated via arc_buf_hdr_t_full and arc_buf_hdr_t_l2only
|
|
|
|
* caches), and arc_buf_t structures (allocated via arc_buf_t
|
|
|
|
* cache).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_hdr_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers of type equal to
|
|
|
|
* ARC_BUFC_DATA. This is generally consumed by buffers backing
|
|
|
|
* on disk user data (e.g. plain file contents).
|
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
kstat_named_t arcstat_data_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers of type equal to
|
|
|
|
* ARC_BUFC_METADATA. This is generally consumed by buffers
|
|
|
|
* backing on disk data that is used for internal ZFS
|
|
|
|
* structures (e.g. ZAP, dnode, indirect blocks, etc).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_metadata_size;
|
|
|
|
/*
|
2016-07-13 15:42:40 +03:00
|
|
|
* Number of bytes consumed by dmu_buf_impl_t objects.
|
2015-06-27 00:54:17 +03:00
|
|
|
*/
|
2016-07-13 15:42:40 +03:00
|
|
|
kstat_named_t arcstat_dbuf_size;
|
|
|
|
/*
|
|
|
|
* Number of bytes consumed by dnode_t objects.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_dnode_size;
|
|
|
|
/*
|
|
|
|
* Number of bytes consumed by bonus buffers.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_bonus_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Total number of bytes consumed by ARC buffers residing in the
|
|
|
|
* arc_anon state. This includes *all* buffers in the arc_anon
|
|
|
|
* state; e.g. data, metadata, evictable, and unevictable buffers
|
|
|
|
* are all included in this value.
|
|
|
|
*/
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_anon_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that meet the
|
|
|
|
* following criteria: backing buffers of type ARC_BUFC_DATA,
|
|
|
|
* residing in the arc_anon state, and are eligible for eviction
|
|
|
|
* (e.g. have no outstanding holds on the buffer).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_anon_evictable_data;
|
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that meet the
|
|
|
|
* following criteria: backing buffers of type ARC_BUFC_METADATA,
|
|
|
|
* residing in the arc_anon state, and are eligible for eviction
|
|
|
|
* (e.g. have no outstanding holds on the buffer).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_anon_evictable_metadata;
|
|
|
|
/*
|
|
|
|
* Total number of bytes consumed by ARC buffers residing in the
|
|
|
|
* arc_mru state. This includes *all* buffers in the arc_mru
|
|
|
|
* state; e.g. data, metadata, evictable, and unevictable buffers
|
|
|
|
* are all included in this value.
|
|
|
|
*/
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_mru_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that meet the
|
|
|
|
* following criteria: backing buffers of type ARC_BUFC_DATA,
|
|
|
|
* residing in the arc_mru state, and are eligible for eviction
|
|
|
|
* (e.g. have no outstanding holds on the buffer).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mru_evictable_data;
|
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that meet the
|
|
|
|
* following criteria: backing buffers of type ARC_BUFC_METADATA,
|
|
|
|
* residing in the arc_mru state, and are eligible for eviction
|
|
|
|
* (e.g. have no outstanding holds on the buffer).
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mru_evictable_metadata;
|
|
|
|
/*
|
|
|
|
* Total number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers in the arc_mru_ghost state. The key thing to note
|
|
|
|
* here, is the fact that this size doesn't actually indicate
|
|
|
|
* RAM consumption. The ghost lists only consist of headers and
|
|
|
|
* don't actually have ARC buffers linked off of these headers.
|
|
|
|
* Thus, *if* the headers had associated ARC buffers, these
|
|
|
|
* buffers *would have* consumed this number of bytes.
|
|
|
|
*/
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_mru_ghost_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers that are eligible for eviction, of type
|
|
|
|
* ARC_BUFC_DATA, and linked off the arc_mru_ghost state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mru_ghost_evictable_data;
|
|
|
|
/*
|
|
|
|
* Number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers that are eligible for eviction, of type
|
|
|
|
* ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mru_ghost_evictable_metadata;
|
|
|
|
/*
|
|
|
|
* Total number of bytes consumed by ARC buffers residing in the
|
|
|
|
* arc_mfu state. This includes *all* buffers in the arc_mfu
|
|
|
|
* state; e.g. data, metadata, evictable, and unevictable buffers
|
|
|
|
* are all included in this value.
|
|
|
|
*/
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_mfu_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that are eligible for
|
|
|
|
* eviction, of type ARC_BUFC_DATA, and reside in the arc_mfu
|
|
|
|
* state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mfu_evictable_data;
|
|
|
|
/*
|
|
|
|
* Number of bytes consumed by ARC buffers that are eligible for
|
|
|
|
* eviction, of type ARC_BUFC_METADATA, and reside in the
|
|
|
|
* arc_mfu state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mfu_evictable_metadata;
|
|
|
|
/*
|
|
|
|
* Total number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers in the arc_mfu_ghost state. See the comment above
|
|
|
|
* arcstat_mru_ghost_size for more details.
|
|
|
|
*/
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_mfu_ghost_size;
|
2015-06-27 00:54:17 +03:00
|
|
|
/*
|
|
|
|
* Number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers that are eligible for eviction, of type
|
|
|
|
* ARC_BUFC_DATA, and linked off the arc_mfu_ghost state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mfu_ghost_evictable_data;
|
|
|
|
/*
|
|
|
|
* Number of bytes that *would have been* consumed by ARC
|
|
|
|
* buffers that are eligible for eviction, of type
|
|
|
|
* ARC_BUFC_METADATA, and linked off the arc_mru_ghost state.
|
|
|
|
*/
|
|
|
|
kstat_named_t arcstat_mfu_ghost_evictable_metadata;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_hits;
|
|
|
|
kstat_named_t arcstat_l2_misses;
|
|
|
|
kstat_named_t arcstat_l2_feeds;
|
|
|
|
kstat_named_t arcstat_l2_rw_clash;
|
2009-02-18 23:51:31 +03:00
|
|
|
kstat_named_t arcstat_l2_read_bytes;
|
|
|
|
kstat_named_t arcstat_l2_write_bytes;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_writes_sent;
|
|
|
|
kstat_named_t arcstat_l2_writes_done;
|
|
|
|
kstat_named_t arcstat_l2_writes_error;
|
2015-01-13 06:52:19 +03:00
|
|
|
kstat_named_t arcstat_l2_writes_lock_retry;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_evict_lock_retry;
|
|
|
|
kstat_named_t arcstat_l2_evict_reading;
|
2014-12-30 06:12:23 +03:00
|
|
|
kstat_named_t arcstat_l2_evict_l1cached;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_free_on_write;
|
|
|
|
kstat_named_t arcstat_l2_abort_lowmem;
|
|
|
|
kstat_named_t arcstat_l2_cksum_bad;
|
|
|
|
kstat_named_t arcstat_l2_io_error;
|
|
|
|
kstat_named_t arcstat_l2_size;
|
2013-08-02 00:02:10 +04:00
|
|
|
kstat_named_t arcstat_l2_asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_hdr_size;
|
|
|
|
kstat_named_t arcstat_memory_throttle_count;
|
2011-03-30 05:08:59 +04:00
|
|
|
kstat_named_t arcstat_memory_direct_count;
|
|
|
|
kstat_named_t arcstat_memory_indirect_count;
|
2011-03-24 22:13:55 +03:00
|
|
|
kstat_named_t arcstat_no_grow;
|
|
|
|
kstat_named_t arcstat_tempreserve;
|
|
|
|
kstat_named_t arcstat_loaned_bytes;
|
2011-12-23 00:20:43 +04:00
|
|
|
kstat_named_t arcstat_prune;
|
2011-03-24 22:13:55 +03:00
|
|
|
kstat_named_t arcstat_meta_used;
|
|
|
|
kstat_named_t arcstat_meta_limit;
|
2016-07-13 15:42:40 +03:00
|
|
|
kstat_named_t arcstat_dnode_limit;
|
2011-03-24 22:13:55 +03:00
|
|
|
kstat_named_t arcstat_meta_max;
|
2015-01-13 06:52:19 +03:00
|
|
|
kstat_named_t arcstat_meta_min;
|
2015-12-27 00:10:31 +03:00
|
|
|
kstat_named_t arcstat_sync_wait_for_async;
|
|
|
|
kstat_named_t arcstat_demand_hit_predictive_prefetch;
|
2015-07-27 23:17:32 +03:00
|
|
|
kstat_named_t arcstat_need_free;
|
|
|
|
kstat_named_t arcstat_sys_free;
|
2008-11-20 23:01:55 +03:00
|
|
|
} arc_stats_t;
|
|
|
|
|
|
|
|
static arc_stats_t arc_stats = {
|
|
|
|
{ "hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_data_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "deleted", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mutex_miss", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_skip", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "evict_not_enough", KSTAT_DATA_UINT64 },
|
2010-05-29 00:45:14 +04:00
|
|
|
{ "evict_l2_cached", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "evict_l2_skip", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hash_elements", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_elements_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_collisions", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chains", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chain_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "p", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_min", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "size", KSTAT_DATA_UINT64 },
|
2016-06-02 07:04:53 +03:00
|
|
|
{ "compressed_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "uncompressed_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "overhead_size", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hdr_size", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "data_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "metadata_size", KSTAT_DATA_UINT64 },
|
2016-07-13 15:42:40 +03:00
|
|
|
{ "dbuf_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "dnode_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "bonus_size", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "anon_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "anon_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "anon_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mru_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mru_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mru_ghost_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mru_ghost_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mfu_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mfu_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_evictable_metadata", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "mfu_ghost_size", KSTAT_DATA_UINT64 },
|
2015-06-27 00:54:17 +03:00
|
|
|
{ "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_feeds", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rw_clash", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "l2_read_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_write_bytes", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_writes_sent", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_done", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_error", KSTAT_DATA_UINT64 },
|
2015-01-13 06:52:19 +03:00
|
|
|
{ "l2_writes_lock_retry", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_evict_reading", KSTAT_DATA_UINT64 },
|
2014-12-30 06:12:23 +03:00
|
|
|
{ "l2_evict_l1cached", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_free_on_write", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_io_error", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_size", KSTAT_DATA_UINT64 },
|
2013-08-02 00:02:10 +04:00
|
|
|
{ "l2_asize", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hdr_size", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "memory_throttle_count", KSTAT_DATA_UINT64 },
|
2011-03-30 05:08:59 +04:00
|
|
|
{ "memory_direct_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "memory_indirect_count", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_no_grow", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_tempreserve", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_loaned_bytes", KSTAT_DATA_UINT64 },
|
2011-12-23 00:20:43 +04:00
|
|
|
{ "arc_prune", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_meta_used", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_meta_limit", KSTAT_DATA_UINT64 },
|
2016-07-13 15:42:40 +03:00
|
|
|
{ "arc_dnode_limit", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_meta_max", KSTAT_DATA_UINT64 },
|
2015-07-27 23:17:32 +03:00
|
|
|
{ "arc_meta_min", KSTAT_DATA_UINT64 },
|
2015-12-27 00:10:31 +03:00
|
|
|
{ "sync_wait_for_async", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },
|
2015-07-27 23:17:32 +03:00
|
|
|
{ "arc_need_free", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_sys_free", KSTAT_DATA_UINT64 }
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
#define ARCSTAT(stat) (arc_stats.stat.value.ui64)
|
|
|
|
|
|
|
|
#define ARCSTAT_INCR(stat, val) \
|
2013-06-11 21:12:34 +04:00
|
|
|
atomic_add_64(&arc_stats.stat.value.ui64, (val))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
|
2008-11-20 23:01:55 +03:00
|
|
|
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
|
|
|
|
|
|
|
|
#define ARCSTAT_MAX(stat, val) { \
|
|
|
|
uint64_t m; \
|
|
|
|
while ((val) > (m = arc_stats.stat.value.ui64) && \
|
|
|
|
(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
|
|
|
|
continue; \
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ARCSTAT_MAXSTAT(stat) \
|
|
|
|
ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We define a macro to allow ARC hits/misses to be easily broken down by
|
|
|
|
* two separate conditions, giving a total of four different subtypes for
|
|
|
|
* each of hits and misses (so eight statistics total).
|
|
|
|
*/
|
|
|
|
#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
|
|
|
|
if (cond1) { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
|
|
|
|
} \
|
|
|
|
} else { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
|
|
|
|
} \
|
|
|
|
}
|
|
|
|
|
|
|
|
kstat_t *arc_ksp;
|
2010-05-29 00:45:14 +04:00
|
|
|
static arc_state_t *arc_anon;
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_state_t *arc_mru;
|
|
|
|
static arc_state_t *arc_mru_ghost;
|
|
|
|
static arc_state_t *arc_mfu;
|
|
|
|
static arc_state_t *arc_mfu_ghost;
|
|
|
|
static arc_state_t *arc_l2c_only;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are several ARC variables that are critical to export as kstats --
|
|
|
|
* but we don't want to have to grovel around in the kstat whenever we wish to
|
|
|
|
* manipulate them. For these variables, we therefore define them to be in
|
|
|
|
* terms of the statistic variable. This assures that we are not introducing
|
|
|
|
* the possibility of inconsistency by having shadow copies of the variables,
|
|
|
|
* while still allowing the code to be readable.
|
|
|
|
*/
|
|
|
|
#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
|
|
|
|
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
|
|
|
|
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
|
|
|
|
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
|
|
|
|
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
|
2016-06-02 07:04:53 +03:00
|
|
|
#define arc_no_grow ARCSTAT(arcstat_no_grow) /* do not grow cache size */
|
2011-03-24 22:13:55 +03:00
|
|
|
#define arc_tempreserve ARCSTAT(arcstat_tempreserve)
|
|
|
|
#define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes)
|
2013-02-18 00:00:54 +04:00
|
|
|
#define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */
|
2016-07-13 15:42:40 +03:00
|
|
|
#define arc_dnode_limit ARCSTAT(arcstat_dnode_limit) /* max size for dnodes */
|
2015-01-13 06:52:19 +03:00
|
|
|
#define arc_meta_min ARCSTAT(arcstat_meta_min) /* min size for metadata */
|
2013-02-18 00:00:54 +04:00
|
|
|
#define arc_meta_used ARCSTAT(arcstat_meta_used) /* size of metadata */
|
|
|
|
#define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */
|
2016-07-13 15:42:40 +03:00
|
|
|
#define arc_dbuf_size ARCSTAT(arcstat_dbuf_size) /* dbuf metadata */
|
|
|
|
#define arc_dnode_size ARCSTAT(arcstat_dnode_size) /* dnode metadata */
|
|
|
|
#define arc_bonus_size ARCSTAT(arcstat_bonus_size) /* bonus buffer metadata */
|
2015-07-27 23:17:32 +03:00
|
|
|
#define arc_need_free ARCSTAT(arcstat_need_free) /* bytes to be freed */
|
|
|
|
#define arc_sys_free ARCSTAT(arcstat_sys_free) /* target system free bytes */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* compressed size of entire arc */
|
|
|
|
#define arc_compressed_size ARCSTAT(arcstat_compressed_size)
|
|
|
|
/* uncompressed size of entire arc */
|
|
|
|
#define arc_uncompressed_size ARCSTAT(arcstat_uncompressed_size)
|
|
|
|
/* number of bytes in the arc from arc_buf_t's */
|
|
|
|
#define arc_overhead_size ARCSTAT(arcstat_overhead_size)
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
static list_t arc_prune_list;
|
|
|
|
static kmutex_t arc_prune_mtx;
|
2015-05-30 17:57:53 +03:00
|
|
|
static taskq_t *arc_prune_taskq;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define GHOST_STATE(state) \
|
|
|
|
((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
|
|
|
|
(state) == arc_l2c_only)
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)
|
|
|
|
#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)
|
|
|
|
#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR)
|
|
|
|
#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH)
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_COMPRESSION_ENABLED(hdr) \
|
|
|
|
((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE)
|
|
|
|
#define HDR_L2_READING(hdr) \
|
2016-06-02 07:04:53 +03:00
|
|
|
(((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \
|
|
|
|
((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))
|
2014-12-06 20:24:32 +03:00
|
|
|
#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING)
|
|
|
|
#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)
|
|
|
|
#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_ISTYPE_METADATA(hdr) \
|
2016-06-02 07:04:53 +03:00
|
|
|
((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr))
|
|
|
|
|
|
|
|
#define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)
|
|
|
|
#define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* For storing compression mode in b_flags */
|
|
|
|
#define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1)
|
|
|
|
|
|
|
|
#define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \
|
|
|
|
HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))
|
|
|
|
#define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \
|
|
|
|
HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));
|
|
|
|
|
|
|
|
#define ARC_BUF_LAST(buf) ((buf)->b_next == NULL)
|
2016-07-14 00:17:41 +03:00
|
|
|
#define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED)
|
|
|
|
#define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Other sizes
|
|
|
|
*/
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
#define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
|
|
|
|
#define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Hash table routines
|
|
|
|
*/
|
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
#define HT_LOCK_ALIGN 64
|
|
|
|
#define HT_LOCK_PAD (P2NPHASE(sizeof (kmutex_t), (HT_LOCK_ALIGN)))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
struct ht_lock {
|
|
|
|
kmutex_t ht_lock;
|
|
|
|
#ifdef _KERNEL
|
2010-08-26 22:46:09 +04:00
|
|
|
unsigned char pad[HT_LOCK_PAD];
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
2014-10-24 03:00:41 +04:00
|
|
|
#define BUF_LOCKS 8192
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct buf_hash_table {
|
|
|
|
uint64_t ht_mask;
|
|
|
|
arc_buf_hdr_t **ht_table;
|
|
|
|
struct ht_lock ht_locks[BUF_LOCKS];
|
|
|
|
} buf_hash_table_t;
|
|
|
|
|
|
|
|
static buf_hash_table_t buf_hash_table;
|
|
|
|
|
|
|
|
#define BUF_HASH_INDEX(spa, dva, birth) \
|
|
|
|
(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
|
|
|
|
#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
|
|
|
|
#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
|
2010-05-29 00:45:14 +04:00
|
|
|
#define HDR_LOCK(hdr) \
|
|
|
|
(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
uint64_t zfs_crc64_table[256];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
|
2013-08-02 00:02:10 +04:00
|
|
|
#define L2ARC_HEADROOM 2 /* num of writes */
|
2016-02-10 21:42:01 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* If we discover during ARC scan any buffers to be compressed, we boost
|
|
|
|
* our headroom for the next scanning cycle by this percentage multiple.
|
|
|
|
*/
|
|
|
|
#define L2ARC_HEADROOM_BOOST 200
|
2009-02-18 23:51:31 +03:00
|
|
|
#define L2ARC_FEED_SECS 1 /* caching interval secs */
|
|
|
|
#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
|
|
|
|
#define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* L2ARC Performance Tunables */
|
2011-07-08 23:41:57 +04:00
|
|
|
unsigned long l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */
|
|
|
|
unsigned long l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */
|
|
|
|
unsigned long l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */
|
2013-08-02 00:02:10 +04:00
|
|
|
unsigned long l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
|
2011-07-08 23:41:57 +04:00
|
|
|
unsigned long l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
|
|
|
|
unsigned long l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */
|
|
|
|
int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
|
|
|
|
int l2arc_feed_again = B_TRUE; /* turbo warmup */
|
2013-07-24 20:57:56 +04:00
|
|
|
int l2arc_norw = B_FALSE; /* no reads during writes */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* L2ARC Internals
|
|
|
|
*/
|
|
|
|
static list_t L2ARC_dev_list; /* device list */
|
|
|
|
static list_t *l2arc_dev_list; /* device list pointer */
|
|
|
|
static kmutex_t l2arc_dev_mtx; /* device list mutex */
|
|
|
|
static l2arc_dev_t *l2arc_dev_last; /* last device used */
|
|
|
|
static list_t L2ARC_free_on_write; /* free after write buf list */
|
|
|
|
static list_t *l2arc_free_on_write; /* free after write list ptr */
|
|
|
|
static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
|
|
|
|
static uint64_t l2arc_ndev; /* number of devices */
|
|
|
|
|
|
|
|
typedef struct l2arc_read_callback {
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_hdr_t *l2rcb_hdr; /* read header */
|
2013-08-02 00:02:10 +04:00
|
|
|
blkptr_t l2rcb_bp; /* original blkptr */
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t l2rcb_zb; /* original bookmark */
|
2013-08-02 00:02:10 +04:00
|
|
|
int l2rcb_flags; /* original flags */
|
2008-11-20 23:01:55 +03:00
|
|
|
} l2arc_read_callback_t;
|
|
|
|
|
|
|
|
typedef struct l2arc_data_free {
|
|
|
|
/* protected by l2arc_free_on_write_mtx */
|
|
|
|
void *l2df_data;
|
|
|
|
size_t l2df_size;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_contents_t l2df_type;
|
2008-11-20 23:01:55 +03:00
|
|
|
list_node_t l2df_list_node;
|
|
|
|
} l2arc_data_free_t;
|
|
|
|
|
|
|
|
static kmutex_t l2arc_feed_thr_lock;
|
|
|
|
static kcondvar_t l2arc_feed_thr_cv;
|
|
|
|
static uint8_t l2arc_thread_exit;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *);
|
|
|
|
static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *);
|
|
|
|
static void arc_hdr_free_pdata(arc_buf_hdr_t *hdr);
|
|
|
|
static void arc_hdr_alloc_pdata(arc_buf_hdr_t *);
|
2014-12-06 20:24:32 +03:00
|
|
|
static void arc_access(arc_buf_hdr_t *, kmutex_t *);
|
2015-01-13 06:52:19 +03:00
|
|
|
static boolean_t arc_is_overflowing(void);
|
2014-12-06 20:24:32 +03:00
|
|
|
static void arc_buf_watch(arc_buf_t *);
|
2015-06-26 21:28:18 +03:00
|
|
|
static void arc_tuning_update(void);
|
2016-07-13 15:42:40 +03:00
|
|
|
static void arc_prune_async(int64_t);
|
2014-12-06 20:24:32 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);
|
|
|
|
static uint32_t arc_bufc_to_flags(arc_buf_contents_t);
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
|
|
|
|
static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
|
|
|
|
static void l2arc_read_done(zio_t *);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static uint64_t
|
2009-02-18 23:51:31 +03:00
|
|
|
buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint8_t *vdva = (uint8_t *)dva;
|
|
|
|
uint64_t crc = -1ULL;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof (dva_t); i++)
|
|
|
|
crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
crc ^= (spa>>8) ^ birth;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (crc);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_EMPTY(hdr) \
|
|
|
|
((hdr)->b_dva.dva_word[0] == 0 && \
|
|
|
|
(hdr)->b_dva.dva_word[1] == 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define HDR_EQUAL(spa, dva, birth, hdr) \
|
|
|
|
((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
|
|
|
|
((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
|
|
|
|
((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
buf_discard_identity(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
hdr->b_dva.dva_word[0] = 0;
|
|
|
|
hdr->b_dva.dva_word[1] = 0;
|
|
|
|
hdr->b_birth = 0;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_buf_hdr_t *
|
2014-06-06 01:19:08 +04:00
|
|
|
buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
const dva_t *dva = BP_IDENTITY(bp);
|
|
|
|
uint64_t birth = BP_PHYSICAL_BIRTH(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
|
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(hash_lock);
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
|
|
|
|
hdr = hdr->b_hash_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_EQUAL(spa, dva, birth, hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
*lockp = hash_lock;
|
2014-12-06 20:24:32 +03:00
|
|
|
return (hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
*lockp = NULL;
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert an entry into the hash table. If there is already an element
|
|
|
|
* equal to elem in the hash table, then the already existing element
|
|
|
|
* will be returned and the new element will not be inserted.
|
|
|
|
* Otherwise returns NULL.
|
2014-12-30 06:12:23 +03:00
|
|
|
* If lockp == NULL, the caller is assumed to already hold the hash lock.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static arc_buf_hdr_t *
|
2014-12-06 20:24:32 +03:00
|
|
|
buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *fhdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint32_t i;
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));
|
|
|
|
ASSERT(hdr->b_birth != 0);
|
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
if (lockp != NULL) {
|
|
|
|
*lockp = hash_lock;
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
} else {
|
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;
|
|
|
|
fhdr = fhdr->b_hash_next, i++) {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))
|
2014-12-06 20:24:32 +03:00
|
|
|
return (fhdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hdr->b_hash_next = buf_hash_table.ht_table[idx];
|
|
|
|
buf_hash_table.ht_table[idx] = hdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
|
|
|
if (i > 0) {
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_collisions);
|
|
|
|
if (i == 1)
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_chains);
|
|
|
|
|
|
|
|
ARCSTAT_MAX(arcstat_hash_chain_max, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_elements);
|
|
|
|
ARCSTAT_MAXSTAT(arcstat_hash_elements);
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2014-12-06 20:24:32 +03:00
|
|
|
buf_hash_remove(arc_buf_hdr_t *hdr)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *fhdr, **hdrp;
|
|
|
|
uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(HDR_IN_HASH_TABLE(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hdrp = &buf_hash_table.ht_table[idx];
|
|
|
|
while ((fhdr = *hdrp) != hdr) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(fhdr, !=, NULL);
|
2014-12-06 20:24:32 +03:00
|
|
|
hdrp = &fhdr->b_hash_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-06 20:24:32 +03:00
|
|
|
*hdrp = hdr->b_hash_next;
|
|
|
|
hdr->b_hash_next = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_hash_elements);
|
|
|
|
|
|
|
|
if (buf_hash_table.ht_table[idx] &&
|
|
|
|
buf_hash_table.ht_table[idx]->b_hash_next == NULL)
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_hash_chains);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global data structures and functions for the buf kmem cache.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
static kmem_cache_t *hdr_full_cache;
|
|
|
|
static kmem_cache_t *hdr_l2only_cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
static kmem_cache_t *buf_cache;
|
|
|
|
|
|
|
|
static void
|
|
|
|
buf_fini(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_free() in the linux kernel\
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
vmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < BUF_LOCKS; i++)
|
|
|
|
mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
kmem_cache_destroy(hdr_full_cache);
|
|
|
|
kmem_cache_destroy(hdr_l2only_cache);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_cache_destroy(buf_cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Constructor callback - called when the cache is empty
|
|
|
|
* and a new buf is requested.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
|
|
|
|
|
|
|
bzero(hdr, HDR_FULL_SIZE);
|
|
|
|
cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
refcount_create(&hdr->b_l1hdr.b_refcnt);
|
|
|
|
mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
list_link_init(&hdr->b_l1hdr.b_arc_node);
|
|
|
|
list_link_init(&hdr->b_l2hdr.b_l2node);
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_link_init(&hdr->b_l1hdr.b_arc_node);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
hdr_l2only_cons(void *vbuf, void *unused, int kmflag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
bzero(hdr, HDR_L2ONLY_SIZE);
|
|
|
|
arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
buf_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf = vbuf;
|
|
|
|
|
|
|
|
bzero(buf, sizeof (arc_buf_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Destructor callback - called when a cached buf is
|
|
|
|
* no longer required.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_dest(void *vbuf, void *unused)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr = vbuf;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
cv_destroy(&hdr->b_l1hdr.b_cv);
|
|
|
|
refcount_destroy(&hdr->b_l1hdr.b_refcnt);
|
|
|
|
mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
hdr_l2only_dest(void *vbuf, void *unused)
|
|
|
|
{
|
|
|
|
ASSERTV(arc_buf_hdr_t *hdr = vbuf);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
buf_dest(void *vbuf, void *unused)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf = vbuf;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_destroy(&buf->b_evict_lock);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2015-06-29 20:34:47 +03:00
|
|
|
/*
|
|
|
|
* Reclaim callback -- invoked when memory is low.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
hdr_recl(void *unused)
|
|
|
|
{
|
|
|
|
dprintf("hdr_recl called\n");
|
|
|
|
/*
|
|
|
|
* umem calls the reclaim func when we destroy the buf cache,
|
|
|
|
* which is after we do arc_fini().
|
|
|
|
*/
|
|
|
|
if (!arc_dead)
|
|
|
|
cv_signal(&arc_reclaim_thread_cv);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
buf_init(void)
|
|
|
|
{
|
2016-10-01 01:04:21 +03:00
|
|
|
uint64_t *ct = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t hsize = 1ULL << 12;
|
|
|
|
int i, j;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The hash table is big enough to fill all of physical memory
|
2014-08-20 21:09:40 +04:00
|
|
|
* with an average block size of zfs_arc_average_blocksize (default 8K).
|
|
|
|
* By default, the table will take up
|
|
|
|
* totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-08-20 21:09:40 +04:00
|
|
|
while (hsize * zfs_arc_average_blocksize < physmem * PAGESIZE)
|
2008-11-20 23:01:55 +03:00
|
|
|
hsize <<= 1;
|
|
|
|
retry:
|
|
|
|
buf_hash_table.ht_mask = hsize - 1;
|
2010-08-26 22:46:09 +04:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_alloc() in the linux kernel
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
if (buf_hash_table.ht_table == NULL) {
|
|
|
|
ASSERT(hsize > (1ULL << 8));
|
|
|
|
hsize >>= 1;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,
|
2015-06-29 20:34:47 +03:00
|
|
|
0, hdr_full_cons, hdr_full_dest, hdr_recl, NULL, NULL, 0);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",
|
2015-06-29 20:34:47 +03:00
|
|
|
HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, hdr_recl,
|
2014-12-30 06:12:23 +03:00
|
|
|
NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
|
2008-12-03 23:09:06 +03:00
|
|
|
0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < 256; i++)
|
|
|
|
for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
|
|
|
|
*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
|
|
|
|
|
|
|
|
for (i = 0; i < BUF_LOCKS; i++) {
|
|
|
|
mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
|
2015-03-31 06:43:29 +03:00
|
|
|
NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
#define ARC_MINTIME (hz>>4) /* 62 ms */
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* This is the size that the buf occupies in memory. If the buf is compressed,
|
|
|
|
* it will correspond to the compressed size. You should use this method of
|
|
|
|
* getting the buf size unless you explicitly need the logical size.
|
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
arc_buf_size(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (ARC_BUF_COMPRESSED(buf) ?
|
|
|
|
HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
arc_buf_lsize(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (HDR_GET_LSIZE(buf->b_hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
enum zio_compress
|
|
|
|
arc_get_compression(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (ARC_BUF_COMPRESSED(buf) ?
|
|
|
|
HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline boolean_t
|
|
|
|
arc_buf_is_shared(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
boolean_t shared = (buf->b_data != NULL &&
|
|
|
|
buf->b_data == buf->b_hdr->b_l1hdr.b_pdata);
|
|
|
|
IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
IMPLY(shared, ARC_BUF_SHARED(buf));
|
|
|
|
IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It would be nice to assert arc_can_share() too, but the "hdr isn't
|
|
|
|
* already being shared" requirement prevents us from doing that.
|
|
|
|
*/
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (shared);
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static inline void
|
|
|
|
arc_cksum_free(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
|
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
|
|
|
|
kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));
|
|
|
|
hdr->b_l1hdr.b_freeze_cksum = NULL;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data
|
|
|
|
* matches the checksum that is stored in the hdr. If there is no checksum,
|
|
|
|
* or if the buf is compressed, this is a no-op.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_cksum_verify(arc_buf_t *buf)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_cksum_t zc;
|
|
|
|
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
|
|
|
|
hdr->b_l1hdr.b_bufcnt > 1);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
mutex_enter(&hdr->b_l1hdr.b_freeze_lock);
|
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {
|
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))
|
2008-11-20 23:01:55 +03:00
|
|
|
panic("buffer modified while frozen!");
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static boolean_t
|
|
|
|
arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
enum zio_compress compress = BP_GET_COMPRESS(zio->io_bp);
|
|
|
|
boolean_t valid_cksum;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(zio->io_bp));
|
|
|
|
VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* We rely on the blkptr's checksum to determine if the block
|
|
|
|
* is valid or not. When compressed arc is enabled, the l2arc
|
|
|
|
* writes the block to the l2arc just as it appears in the pool.
|
|
|
|
* This allows us to use the blkptr's checksum to validate the
|
|
|
|
* data that we just read off of the l2arc without having to store
|
|
|
|
* a separate checksum in the arc_buf_hdr_t. However, if compressed
|
|
|
|
* arc is disabled, then the data written to the l2arc is always
|
|
|
|
* uncompressed and won't match the block as it exists in the main
|
|
|
|
* pool. When this is the case, we must first compress it if it is
|
|
|
|
* compressed on the main pool before we can validate the checksum.
|
|
|
|
*/
|
|
|
|
if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
|
|
|
|
uint64_t lsize;
|
|
|
|
uint64_t csize;
|
|
|
|
void *cbuf;
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
|
|
|
|
|
|
|
|
cbuf = zio_buf_alloc(HDR_GET_PSIZE(hdr));
|
|
|
|
lsize = HDR_GET_LSIZE(hdr);
|
|
|
|
csize = zio_compress_data(compress, zio->io_data, cbuf, lsize);
|
|
|
|
ASSERT3U(csize, <=, HDR_GET_PSIZE(hdr));
|
|
|
|
if (csize < HDR_GET_PSIZE(hdr)) {
|
|
|
|
/*
|
|
|
|
* Compressed blocks are always a multiple of the
|
|
|
|
* smallest ashift in the pool. Ideally, we would
|
|
|
|
* like to round up the csize to the next
|
|
|
|
* spa_min_ashift but that value may have changed
|
|
|
|
* since the block was last written. Instead,
|
|
|
|
* we rely on the fact that the hdr's psize
|
|
|
|
* was set to the psize of the block when it was
|
|
|
|
* last written. We set the csize to that value
|
|
|
|
* and zero out any part that should not contain
|
|
|
|
* data.
|
|
|
|
*/
|
|
|
|
bzero((char *)cbuf + csize, HDR_GET_PSIZE(hdr) - csize);
|
|
|
|
csize = HDR_GET_PSIZE(hdr);
|
|
|
|
}
|
|
|
|
zio_push_transform(zio, cbuf, csize, HDR_GET_PSIZE(hdr), NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block pointers always store the checksum for the logical data.
|
|
|
|
* If the block pointer has the gang bit set, then the checksum
|
|
|
|
* it represents is for the reconstituted data and not for an
|
|
|
|
* individual gang member. The zio pipeline, however, must be able to
|
|
|
|
* determine the checksum of each of the gang constituents so it
|
|
|
|
* treats the checksum comparison differently than what we need
|
|
|
|
* for l2arc blocks. This prevents us from using the
|
|
|
|
* zio_checksum_error() interface directly. Instead we must call the
|
|
|
|
* zio_checksum_error_impl() so that we can ensure the checksum is
|
|
|
|
* generated using the correct checksum algorithm and accounts for the
|
|
|
|
* logical I/O size and not just a gang fragment.
|
|
|
|
*/
|
|
|
|
valid_cksum = (zio_checksum_error_impl(zio->io_spa, zio->io_bp,
|
|
|
|
BP_GET_CHECKSUM(zio->io_bp), zio->io_data, zio->io_size,
|
|
|
|
zio->io_offset, NULL) == 0);
|
|
|
|
zio_pop_transforms(zio);
|
|
|
|
return (valid_cksum);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a
|
|
|
|
* checksum and attaches it to the buf's hdr so that we can ensure that the buf
|
|
|
|
* isn't modified later on. If buf is compressed or there is already a checksum
|
|
|
|
* on the hdr, this is a no-op (we only checksum uncompressed bufs).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_compute(arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_freeze_cksum != NULL) {
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(buf) || hdr->b_l1hdr.b_bufcnt > 1);
|
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
|
|
|
return;
|
|
|
|
} else if (ARC_BUF_COMPRESSED(buf)) {
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Since the checksum doesn't apply to compressed buffers, we
|
|
|
|
* only keep a checksum if there are uncompressed buffers.
|
|
|
|
* Therefore there must be another buffer, which is
|
|
|
|
* uncompressed.
|
|
|
|
*/
|
|
|
|
IMPLY(hdr->b_l1hdr.b_freeze_cksum != NULL,
|
|
|
|
hdr->b_l1hdr.b_bufcnt > 1);
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
|
|
|
|
KM_SLEEP);
|
2016-06-16 01:47:05 +03:00
|
|
|
fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_freeze_cksum);
|
|
|
|
mutex_exit(&hdr->b_l1hdr.b_freeze_lock);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_watch(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef _KERNEL
|
|
|
|
void
|
|
|
|
arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)
|
|
|
|
{
|
|
|
|
panic("Got SIGSEGV at address: 0x%lx\n", (long) si->si_addr);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
arc_buf_unwatch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT0(mprotect(buf->b_data, HDR_GET_LSIZE(buf->b_hdr),
|
2013-05-17 01:18:06 +04:00
|
|
|
PROT_READ | PROT_WRITE));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
arc_buf_watch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch)
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),
|
2016-06-02 07:04:53 +03:00
|
|
|
PROT_READ));
|
2013-05-17 01:18:06 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static arc_buf_contents_t
|
|
|
|
arc_buf_type(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_contents_t type;
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_ISTYPE_METADATA(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
type = ARC_BUFC_METADATA;
|
2014-12-30 06:12:23 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
type = ARC_BUFC_DATA;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
|
|
|
return (type);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
boolean_t
|
|
|
|
arc_is_metadata(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
static uint32_t
|
|
|
|
arc_bufc_to_flags(arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
switch (type) {
|
|
|
|
case ARC_BUFC_DATA:
|
|
|
|
/* metadata field is 0 if buffer contains normal data */
|
|
|
|
return (0);
|
|
|
|
case ARC_BUFC_METADATA:
|
|
|
|
return (ARC_FLAG_BUFC_METADATA);
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
panic("undefined ARC buffer type!");
|
|
|
|
return ((uint32_t)-1);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_buf_thaw(arc_buf_t *buf)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_cksum_verify(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* Compressed buffers do not manipulate the b_freeze_cksum or
|
|
|
|
* allocate b_thawed.
|
|
|
|
*/
|
|
|
|
if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
|
|
|
|
hdr->b_l1hdr.b_bufcnt > 1);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
arc_cksum_free(hdr);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_freeze(arc_buf_t *buf)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2010-05-29 00:45:14 +04:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT(hdr->b_l1hdr.b_freeze_cksum == NULL ||
|
|
|
|
hdr->b_l1hdr.b_bufcnt > 1);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT(hdr->b_l1hdr.b_freeze_cksum != NULL ||
|
|
|
|
hdr->b_l1hdr.b_state == arc_anon);
|
|
|
|
arc_cksum_compute(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The arc_buf_hdr_t's b_flags should never be modified directly. Instead,
|
|
|
|
* the following functions should be used to ensure that the flags are
|
|
|
|
* updated in a thread-safe way. When manipulating the flags either
|
|
|
|
* the hash_lock must be held or the hdr must be undiscoverable. This
|
|
|
|
* ensures that we're not racing with any other threads when updating
|
|
|
|
* the flags.
|
|
|
|
*/
|
|
|
|
static inline void
|
|
|
|
arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
hdr->b_flags |= flags;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
hdr->b_flags &= ~flags;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Setting the compression bits in the arc_buf_hdr_t's b_flags is
|
|
|
|
* done in a special way since we have to clear and set bits
|
|
|
|
* at the same time. Consumers that wish to set the compression bits
|
|
|
|
* must use this function to ensure that the flags are updated in
|
|
|
|
* thread-safe manner.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Holes and embedded blocks will always have a psize = 0 so
|
|
|
|
* we ignore the compression of the blkptr and set the
|
|
|
|
* want to uncompress them. Mark them as uncompressed.
|
|
|
|
*/
|
|
|
|
if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
|
|
|
|
HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT(!HDR_COMPRESSION_ENABLED(hdr));
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
|
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
|
|
|
|
HDR_SET_COMPRESS(hdr, cmp);
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);
|
|
|
|
ASSERT(HDR_COMPRESSION_ENABLED(hdr));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Looks for another buf on the same hdr which has the data decompressed, copies
|
|
|
|
* from it, and returns true. If no such buf exists, returns false.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
arc_buf_try_copy_decompressed_data(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
arc_buf_t *from;
|
|
|
|
boolean_t copied = B_FALSE;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
|
|
|
ASSERT(!ARC_BUF_COMPRESSED(buf));
|
|
|
|
|
|
|
|
for (from = hdr->b_l1hdr.b_buf; from != NULL;
|
|
|
|
from = from->b_next) {
|
|
|
|
/* can't use our own data buffer */
|
|
|
|
if (from == buf) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!ARC_BUF_COMPRESSED(from)) {
|
|
|
|
bcopy(from->b_data, buf->b_data, arc_buf_size(buf));
|
|
|
|
copied = B_TRUE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There were no decompressed bufs, so there should not be a
|
|
|
|
* checksum on the hdr either.
|
|
|
|
*/
|
|
|
|
EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);
|
|
|
|
|
|
|
|
return (copied);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Given a buf that has a data buffer attached to it, this function will
|
|
|
|
* efficiently fill the buf with data of the specified compression setting from
|
|
|
|
* the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr
|
|
|
|
* are already sharing a data buf, no copy is performed.
|
|
|
|
*
|
|
|
|
* If the buf is marked as compressed but uncompressed data was requested, this
|
|
|
|
* will allocate a new data buffer for the buf, remove that flag, and fill the
|
|
|
|
* buf with uncompressed data. You can't request a compressed buf on a hdr with
|
|
|
|
* uncompressed data, and (since we haven't added support for it yet) if you
|
|
|
|
* want compressed data your buf must already be marked as compressed and have
|
|
|
|
* the correct-sized data buffer.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
arc_buf_fill(arc_buf_t *buf, boolean_t compressed)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2016-07-14 00:17:41 +03:00
|
|
|
boolean_t hdr_compressed = (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
|
2016-06-02 07:04:53 +03:00
|
|
|
dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
|
|
|
IMPLY(compressed, hdr_compressed);
|
|
|
|
IMPLY(compressed, ARC_BUF_COMPRESSED(buf));
|
|
|
|
|
|
|
|
if (hdr_compressed == compressed) {
|
2016-07-11 20:45:52 +03:00
|
|
|
if (!arc_buf_is_shared(buf)) {
|
|
|
|
bcopy(hdr->b_l1hdr.b_pdata, buf->b_data,
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_buf_size(buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
} else {
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(hdr_compressed);
|
|
|
|
ASSERT(!compressed);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* If the buf is sharing its data with the hdr, unlink it and
|
|
|
|
* allocate a new data buffer for the buf.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
if (arc_buf_is_shared(buf)) {
|
|
|
|
ASSERT(ARC_BUF_COMPRESSED(buf));
|
|
|
|
|
|
|
|
/* We need to give the buf it's own b_data */
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
|
2016-07-11 20:45:52 +03:00
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Previously overhead was 0; just add new overhead */
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));
|
2016-07-14 00:17:41 +03:00
|
|
|
} else if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
/* We need to reallocate the buf's b_data */
|
|
|
|
arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),
|
|
|
|
buf);
|
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);
|
|
|
|
|
|
|
|
/* We increased the size of b_data; update overhead */
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size,
|
|
|
|
HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* Regardless of the buf's previous compression settings, it
|
|
|
|
* should not be compressed at the end of this function.
|
|
|
|
*/
|
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try copying the data from another buf which already has a
|
|
|
|
* decompressed version. If that's not possible, it's time to
|
|
|
|
* bite the bullet and decompress the data from the hdr.
|
|
|
|
*/
|
|
|
|
if (arc_buf_try_copy_decompressed_data(buf)) {
|
|
|
|
/* Skip byteswapping and checksumming (already done) */
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, !=, NULL);
|
|
|
|
return (0);
|
|
|
|
} else {
|
|
|
|
int error = zio_decompress_data(HDR_GET_COMPRESS(hdr),
|
|
|
|
hdr->b_l1hdr.b_pdata, buf->b_data,
|
|
|
|
HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Absent hardware errors or software bugs, this should
|
|
|
|
* be impossible, but log it anyway so we can debug it.
|
|
|
|
*/
|
|
|
|
if (error != 0) {
|
|
|
|
zfs_dbgmsg(
|
|
|
|
"hdr %p, compress %d, psize %d, lsize %d",
|
|
|
|
hdr, HDR_GET_COMPRESS(hdr),
|
|
|
|
HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));
|
|
|
|
return (SET_ERROR(EIO));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/* Byteswap the buf's data if necessary */
|
2016-06-02 07:04:53 +03:00
|
|
|
if (bswap != DMU_BSWAP_NUMFUNCS) {
|
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr));
|
|
|
|
ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);
|
|
|
|
dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
/* Compute the hdr's checksum if necessary */
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_cksum_compute(buf);
|
2016-07-14 00:17:41 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
int
|
|
|
|
arc_decompress(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (arc_buf_fill(buf, B_FALSE));
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Return the size of the block, b_pdata, that is stored in the arc_buf_hdr_t.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_hdr_size(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
uint64_t size;
|
|
|
|
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
HDR_GET_PSIZE(hdr) > 0) {
|
|
|
|
size = HDR_GET_PSIZE(hdr);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);
|
|
|
|
size = HDR_GET_LSIZE(hdr);
|
|
|
|
}
|
|
|
|
return (size);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Increment the amount of evictable space in the arc_state_t's refcount.
|
|
|
|
* We account for the space used by the hdr and the arc buf individually
|
|
|
|
* so that we can add and remove them from the refcount individually.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)
|
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
arc_buf_t *buf;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
if (GHOST_STATE(state)) {
|
|
|
|
ASSERT0(hdr->b_l1hdr.b_bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
2016-07-11 20:45:52 +03:00
|
|
|
(void) refcount_add_many(&state->arcs_esize[type],
|
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(!GHOST_STATE(state));
|
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
(void) refcount_add_many(&state->arcs_esize[type],
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
}
|
|
|
|
for (buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) {
|
2016-07-11 20:45:52 +03:00
|
|
|
if (arc_buf_is_shared(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
2016-07-11 20:45:52 +03:00
|
|
|
(void) refcount_add_many(&state->arcs_esize[type],
|
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decrement the amount of evictable space in the arc_state_t's refcount.
|
|
|
|
* We account for the space used by the hdr and the arc buf individually
|
|
|
|
* so that we can add and remove them from the refcount individually.
|
|
|
|
*/
|
|
|
|
static void
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
arc_buf_t *buf;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
|
|
|
if (GHOST_STATE(state)) {
|
|
|
|
ASSERT0(hdr->b_l1hdr.b_bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(!GHOST_STATE(state));
|
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
}
|
|
|
|
for (buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) {
|
2016-07-11 20:45:52 +03:00
|
|
|
if (arc_buf_is_shared(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a reference to this hdr indicating that someone is actively
|
|
|
|
* referencing that memory. When the refcount transitions from 0 to 1,
|
|
|
|
* we remove it from the respective arc_state_t list to indicate that
|
|
|
|
* it is not evictable.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
add_reference(arc_buf_hdr_t *hdr, void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_state_t *state;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!MUTEX_HELD(HDR_LOCK(hdr))) {
|
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_anon);
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
state = hdr->b_l1hdr.b_state;
|
|
|
|
|
|
|
|
if ((refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&
|
|
|
|
(state != arc_anon)) {
|
|
|
|
/* We don't use the L2-only state list. */
|
|
|
|
if (state != arc_l2c_only) {
|
2016-06-02 07:04:53 +03:00
|
|
|
multilist_remove(&state->arcs_list[arc_buf_type(hdr)],
|
|
|
|
hdr);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_evictable_space_decrement(hdr, state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
/* remove the prefetch flag if we get a reference */
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Remove a reference from this hdr. When the reference transitions from
|
|
|
|
* 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's
|
|
|
|
* list making it eligible for eviction.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static int
|
2014-12-06 20:24:32 +03:00
|
|
|
remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int cnt;
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
|
|
|
|
ASSERT(!GHOST_STATE(state));
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
|
|
|
* arc_l2c_only counts as a ghost state so we don't need to explicitly
|
|
|
|
* check to prevent usage of the arc_l2c_only list.
|
|
|
|
*/
|
|
|
|
if (((cnt = refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) &&
|
2008-11-20 23:01:55 +03:00
|
|
|
(state != arc_anon)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr);
|
|
|
|
ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
|
|
|
|
arc_evictable_space_increment(hdr, state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
return (cnt);
|
|
|
|
}
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
/*
|
|
|
|
* Returns detailed information about a specific arc buffer. When the
|
|
|
|
* state_index argument is set the function will calculate the arc header
|
|
|
|
* list position for its arc state. Since this requires a linear traversal
|
|
|
|
* callers are strongly encourage not to do this. However, it can be helpful
|
|
|
|
* for targeted analysis so the functionality is provided.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = ab->b_hdr;
|
2014-12-30 06:12:23 +03:00
|
|
|
l1arc_buf_hdr_t *l1hdr = NULL;
|
|
|
|
l2arc_buf_hdr_t *l2hdr = NULL;
|
|
|
|
arc_state_t *state = NULL;
|
|
|
|
|
2016-07-10 17:09:02 +03:00
|
|
|
memset(abi, 0, sizeof (arc_buf_info_t));
|
|
|
|
|
|
|
|
if (hdr == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
abi->abi_flags = hdr->b_flags;
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
l1hdr = &hdr->b_l1hdr;
|
|
|
|
state = l1hdr->b_state;
|
|
|
|
}
|
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
l2hdr = &hdr->b_l2hdr;
|
2013-10-03 04:11:19 +04:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (l1hdr) {
|
2016-06-02 07:04:53 +03:00
|
|
|
abi->abi_bufcnt = l1hdr->b_bufcnt;
|
2014-12-30 06:12:23 +03:00
|
|
|
abi->abi_access = l1hdr->b_arc_access;
|
|
|
|
abi->abi_mru_hits = l1hdr->b_mru_hits;
|
|
|
|
abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits;
|
|
|
|
abi->abi_mfu_hits = l1hdr->b_mfu_hits;
|
|
|
|
abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits;
|
|
|
|
abi->abi_holds = refcount_count(&l1hdr->b_refcnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (l2hdr) {
|
|
|
|
abi->abi_l2arc_dattr = l2hdr->b_daddr;
|
|
|
|
abi->abi_l2arc_hits = l2hdr->b_hits;
|
|
|
|
}
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;
|
2014-12-30 06:12:23 +03:00
|
|
|
abi->abi_state_contents = arc_buf_type(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
abi->abi_size = arc_hdr_size(hdr);
|
2013-10-03 04:11:19 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Move the supplied buffer to the indicated state. The hash lock
|
2008-11-20 23:01:55 +03:00
|
|
|
* for the buffer must be held by the caller.
|
|
|
|
*/
|
|
|
|
static void
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr,
|
|
|
|
kmutex_t *hash_lock)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_state_t *old_state;
|
|
|
|
int64_t refcnt;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint32_t bufcnt;
|
|
|
|
boolean_t update_old, update_new;
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_contents_t buftype = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We almost always have an L1 hdr here, since we call arc_hdr_realloc()
|
|
|
|
* in arc_read() when bringing a buffer out of the L2ARC. However, the
|
|
|
|
* L1 hdr doesn't always exist when we change state to arc_anon before
|
|
|
|
* destroying a header, in which case reallocating to add the L1 hdr is
|
|
|
|
* pointless.
|
|
|
|
*/
|
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
old_state = hdr->b_l1hdr.b_state;
|
|
|
|
refcnt = refcount_count(&hdr->b_l1hdr.b_refcnt);
|
2016-06-02 07:04:53 +03:00
|
|
|
bufcnt = hdr->b_l1hdr.b_bufcnt;
|
|
|
|
update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pdata != NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else {
|
|
|
|
old_state = arc_l2c_only;
|
|
|
|
refcnt = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
bufcnt = 0;
|
|
|
|
update_old = B_FALSE;
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
update_new = update_old;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
ASSERT3P(new_state, !=, old_state);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!GHOST_STATE(new_state) || bufcnt == 0);
|
|
|
|
ASSERT(old_state != arc_anon || bufcnt <= 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this buffer is evictable, transfer it from the
|
|
|
|
* old state list to the new state list.
|
|
|
|
*/
|
|
|
|
if (refcnt == 0) {
|
2014-12-30 06:12:23 +03:00
|
|
|
if (old_state != arc_anon && old_state != arc_l2c_only) {
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_remove(&old_state->arcs_list[buftype], hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (GHOST_STATE(old_state)) {
|
|
|
|
ASSERT0(bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
update_old = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_evictable_space_decrement(hdr, old_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
if (new_state != arc_anon && new_state != arc_l2c_only) {
|
|
|
|
/*
|
|
|
|
* An L1 header always exists here, since if we're
|
|
|
|
* moving to some L1-cached state (i.e. not l2c_only or
|
|
|
|
* anonymous), we realloc the header to add an L1hdr
|
|
|
|
* beforehand.
|
|
|
|
*/
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_insert(&new_state->arcs_list[buftype], hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (GHOST_STATE(new_state)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT0(bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
update_new = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_evictable_space_increment(hdr, new_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_EMPTY(hdr));
|
2014-12-06 20:24:32 +03:00
|
|
|
if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))
|
|
|
|
buf_hash_remove(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/* adjust state sizes (ignore arc_l2c_only) */
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (update_new && new_state != arc_l2c_only) {
|
2015-06-27 01:14:45 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
if (GHOST_STATE(new_state)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT0(bufcnt);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When moving a header to a ghost state, we first
|
2015-06-27 01:14:45 +03:00
|
|
|
* remove all arc buffers. Thus, we'll have a
|
2016-06-02 07:04:53 +03:00
|
|
|
* bufcnt of zero, and no arc buffer to use for
|
2015-06-27 01:14:45 +03:00
|
|
|
* the reference. As a result, we use the arc
|
|
|
|
* header pointer for the reference.
|
|
|
|
*/
|
|
|
|
(void) refcount_add_many(&new_state->arcs_size,
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
2015-06-27 01:14:45 +03:00
|
|
|
} else {
|
|
|
|
arc_buf_t *buf;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint32_t buffers = 0;
|
2015-06-27 01:14:45 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Each individual buffer holds a unique reference,
|
|
|
|
* thus we must remove each of these references one
|
|
|
|
* at a time.
|
|
|
|
*/
|
|
|
|
for (buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
|
|
|
buf = buf->b_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(bufcnt, !=, 0);
|
|
|
|
buffers++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When the arc_buf_t is sharing the data
|
|
|
|
* block with the hdr, the owner of the
|
|
|
|
* reference belongs to the hdr. Only
|
|
|
|
* add to the refcount if the arc_buf_t is
|
|
|
|
* not shared.
|
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
if (arc_buf_is_shared(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
|
|
|
|
2015-06-27 01:14:45 +03:00
|
|
|
(void) refcount_add_many(&new_state->arcs_size,
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
ASSERT3U(bufcnt, ==, buffers);
|
|
|
|
|
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
(void) refcount_add_many(&new_state->arcs_size,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
|
|
|
} else {
|
|
|
|
ASSERT(GHOST_STATE(old_state));
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (update_old && old_state != arc_l2c_only) {
|
2015-06-27 01:14:45 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
if (GHOST_STATE(old_state)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT0(bufcnt);
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2015-06-27 01:14:45 +03:00
|
|
|
/*
|
|
|
|
* When moving a header off of a ghost state,
|
2016-06-02 07:04:53 +03:00
|
|
|
* the header will not contain any arc buffers.
|
|
|
|
* We use the arc header pointer for the reference
|
|
|
|
* which is exactly what we did when we put the
|
|
|
|
* header on the ghost state.
|
2015-06-27 01:14:45 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
(void) refcount_remove_many(&old_state->arcs_size,
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_GET_LSIZE(hdr), hdr);
|
2015-06-27 01:14:45 +03:00
|
|
|
} else {
|
|
|
|
arc_buf_t *buf;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint32_t buffers = 0;
|
2015-06-27 01:14:45 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Each individual buffer holds a unique reference,
|
|
|
|
* thus we must remove each of these references one
|
|
|
|
* at a time.
|
|
|
|
*/
|
|
|
|
for (buf = hdr->b_l1hdr.b_buf; buf != NULL;
|
|
|
|
buf = buf->b_next) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(bufcnt, !=, 0);
|
|
|
|
buffers++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When the arc_buf_t is sharing the data
|
|
|
|
* block with the hdr, the owner of the
|
|
|
|
* reference belongs to the hdr. Only
|
|
|
|
* add to the refcount if the arc_buf_t is
|
|
|
|
* not shared.
|
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
if (arc_buf_is_shared(buf))
|
2016-06-02 07:04:53 +03:00
|
|
|
continue;
|
|
|
|
|
2015-06-27 01:14:45 +03:00
|
|
|
(void) refcount_remove_many(
|
2016-07-11 20:45:52 +03:00
|
|
|
&old_state->arcs_size, arc_buf_size(buf),
|
2016-06-02 07:04:53 +03:00
|
|
|
buf);
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(bufcnt, ==, buffers);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
(void) refcount_remove_many(
|
|
|
|
&old_state->arcs_size, arc_hdr_size(hdr), hdr);
|
2015-06-27 01:14:45 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr))
|
|
|
|
hdr->b_l1hdr.b_state = new_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
|
|
|
* L2 headers should never be on the L2 state list since they don't
|
|
|
|
* have L1 headers allocated.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(multilist_is_empty(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]) &&
|
|
|
|
multilist_is_empty(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
|
|
|
ARCSTAT_INCR(arcstat_data_size, space);
|
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
2015-06-27 00:54:17 +03:00
|
|
|
ARCSTAT_INCR(arcstat_metadata_size, space);
|
2014-02-04 00:41:47 +04:00
|
|
|
break;
|
2016-07-13 15:42:40 +03:00
|
|
|
case ARC_SPACE_BONUS:
|
|
|
|
ARCSTAT_INCR(arcstat_bonus_size, space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_DNODE:
|
|
|
|
ARCSTAT_INCR(arcstat_dnode_size, space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_DBUF:
|
|
|
|
ARCSTAT_INCR(arcstat_dbuf_size, space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, space);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2015-06-27 00:54:17 +03:00
|
|
|
if (type != ARC_SPACE_DATA)
|
2014-02-04 00:41:47 +04:00
|
|
|
ARCSTAT_INCR(arcstat_meta_used, space);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
atomic_add_64(&arc_size, space);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
|
|
|
ARCSTAT_INCR(arcstat_data_size, -space);
|
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
2015-06-27 00:54:17 +03:00
|
|
|
ARCSTAT_INCR(arcstat_metadata_size, -space);
|
2014-02-04 00:41:47 +04:00
|
|
|
break;
|
2016-07-13 15:42:40 +03:00
|
|
|
case ARC_SPACE_BONUS:
|
|
|
|
ARCSTAT_INCR(arcstat_bonus_size, -space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_DNODE:
|
|
|
|
ARCSTAT_INCR(arcstat_dnode_size, -space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_DBUF:
|
|
|
|
ARCSTAT_INCR(arcstat_dbuf_size, -space);
|
2009-02-18 23:51:31 +03:00
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, -space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-02-04 00:41:47 +04:00
|
|
|
if (type != ARC_SPACE_DATA) {
|
|
|
|
ASSERT(arc_meta_used >= space);
|
2015-06-27 00:54:17 +03:00
|
|
|
if (arc_meta_max < arc_meta_used)
|
|
|
|
arc_meta_max = arc_meta_used;
|
2014-02-04 00:41:47 +04:00
|
|
|
ARCSTAT_INCR(arcstat_meta_used, -space);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(arc_size >= space);
|
|
|
|
atomic_add_64(&arc_size, -space);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Given a hdr and a buf, returns whether that buf can share its b_data buffer
|
|
|
|
* with the hdr's b_pdata.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
static boolean_t
|
|
|
|
arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
boolean_t hdr_compressed, buf_compressed;
|
|
|
|
/*
|
|
|
|
* The criteria for sharing a hdr's data are:
|
|
|
|
* 1. the hdr's compression matches the buf's compression
|
|
|
|
* 2. the hdr doesn't need to be byteswapped
|
|
|
|
* 3. the hdr isn't already being shared
|
|
|
|
* 4. the buf is either compressed or it is the last buf in the hdr list
|
|
|
|
*
|
|
|
|
* Criterion #4 maintains the invariant that shared uncompressed
|
|
|
|
* bufs must be the final buf in the hdr's b_buf list. Reading this, you
|
|
|
|
* might ask, "if a compressed buf is allocated first, won't that be the
|
|
|
|
* last thing in the list?", but in that case it's impossible to create
|
|
|
|
* a shared uncompressed buf anyway (because the hdr must be compressed
|
|
|
|
* to have the compressed buf). You might also think that #3 is
|
|
|
|
* sufficient to make this guarantee, however it's possible
|
|
|
|
* (specifically in the rare L2ARC write race mentioned in
|
|
|
|
* arc_buf_alloc_impl()) there will be an existing uncompressed buf that
|
|
|
|
* is sharable, but wasn't at the time of its allocation. Rather than
|
|
|
|
* allow a new shared uncompressed buf to be created and then shuffle
|
|
|
|
* the list around to make it the last element, this simply disallows
|
|
|
|
* sharing if the new buf isn't the first to be added.
|
|
|
|
*/
|
|
|
|
ASSERT3P(buf->b_hdr, ==, hdr);
|
|
|
|
hdr_compressed = HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF;
|
|
|
|
buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;
|
|
|
|
return (buf_compressed == hdr_compressed &&
|
|
|
|
hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&
|
|
|
|
!HDR_SHARED_DATA(hdr) &&
|
|
|
|
(ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a buf for this hdr. If you care about the data that's in the hdr,
|
|
|
|
* or if you want a compressed buffer, pass those flags in. Returns 0 if the
|
|
|
|
* copy was made successfully, or an error code otherwise.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
arc_buf_alloc_impl(arc_buf_hdr_t *hdr, void *tag, boolean_t compressed,
|
|
|
|
boolean_t fill, arc_buf_t **ret)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_t *buf;
|
2016-07-14 00:17:41 +03:00
|
|
|
boolean_t can_share;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
|
|
|
|
VERIFY(hdr->b_type == ARC_BUFC_DATA ||
|
|
|
|
hdr->b_type == ARC_BUFC_METADATA);
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3P(ret, !=, NULL);
|
|
|
|
ASSERT3P(*ret, ==, NULL);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_mru_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_l2_hits = 0;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_hdr = hdr;
|
|
|
|
buf->b_data = NULL;
|
2016-07-11 20:45:52 +03:00
|
|
|
buf->b_next = hdr->b_l1hdr.b_buf;
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags = 0;
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
add_reference(hdr, tag);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're about to change the hdr's b_flags. We must either
|
|
|
|
* hold the hash_lock or be undiscoverable.
|
|
|
|
*/
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Only honor requests for compressed bufs if the hdr is actually
|
|
|
|
* compressed.
|
|
|
|
*/
|
|
|
|
if (compressed && HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF)
|
|
|
|
buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Although the ARC should handle it correctly, levels above the ARC
|
|
|
|
* should prevent us from having multiple compressed bufs off the same
|
|
|
|
* hdr. To ensure we notice it if this behavior changes, we assert this
|
|
|
|
* here the best we can.
|
|
|
|
*/
|
|
|
|
IMPLY(ARC_BUF_COMPRESSED(buf), !HDR_SHARED_DATA(hdr));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the hdr's data can be shared then we share the data buffer and
|
|
|
|
* set the appropriate bit in the hdr's b_flags to indicate the hdr is
|
2016-07-11 20:45:52 +03:00
|
|
|
* allocate a new buffer to store the buf's data.
|
2016-07-14 00:17:41 +03:00
|
|
|
*
|
|
|
|
* There is one additional restriction here because we're sharing
|
|
|
|
* hdr -> buf instead of the usual buf -> hdr: the hdr can't be actively
|
|
|
|
* involved in an L2ARC write, because if this buf is used by an
|
|
|
|
* arc_write() then the hdr's data buffer will be released when the
|
|
|
|
* write completes, even though the L2ARC write might still be using it.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr);
|
|
|
|
|
|
|
|
/* Set up b_data and sharing */
|
|
|
|
if (can_share) {
|
2016-06-02 07:04:53 +03:00
|
|
|
buf->b_data = hdr->b_l1hdr.b_pdata;
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags |= ARC_BUF_FLAG_SHARED;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
} else {
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_data =
|
|
|
|
arc_get_data_buf(hdr, arc_buf_size(buf), buf);
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
VERIFY3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
hdr->b_l1hdr.b_buf = buf;
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_bufcnt += 1;
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If the user wants the data from the hdr, we need to either copy or
|
|
|
|
* decompress the data.
|
|
|
|
*/
|
|
|
|
if (fill) {
|
|
|
|
return (arc_buf_fill(buf, ARC_BUF_COMPRESSED(buf) != 0));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static char *arc_onloan_tag = "onloan";
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Loan out an anonymous arc buffer. Loaned buffers are not counted as in
|
|
|
|
* flight data by arc_tempreserve_space() until they are "returned". Loaned
|
|
|
|
* buffers must be returned to the arc before they can be used by the DMU or
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,
|
|
|
|
is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, size);
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *
|
|
|
|
arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,
|
|
|
|
enum zio_compress compression_type)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,
|
|
|
|
psize, lsize, compression_type);
|
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, psize);
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Return a loaned arc buffer to the arc.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_return_buf(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
(void) refcount_add(&hdr->b_l1hdr.b_refcnt, tag);
|
|
|
|
(void) refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
atomic_add_64(&arc_loaned_bytes, -arc_buf_size(buf));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Detach an arc_buf from a dbuf (tag) */
|
|
|
|
void
|
|
|
|
arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
(void) refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);
|
|
|
|
(void) refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
atomic_add_64(&arc_loaned_bytes, -arc_buf_size(buf));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
l2arc_free_data_on_write(void *data, size_t size, arc_buf_contents_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
df->l2df_data = data;
|
|
|
|
df->l2df_size = size;
|
|
|
|
df->l2df_type = type;
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
|
|
|
list_insert_head(l2arc_free_on_write, df);
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_free_on_write(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
uint64_t size = arc_hdr_size(hdr);
|
2012-12-22 02:57:09 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* protected by hash lock, if in the hash table */
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
ASSERT(state != arc_anon && state != arc_l2c_only);
|
|
|
|
|
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
|
|
|
size, hdr);
|
2012-12-22 02:57:09 +04:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_remove_many(&state->arcs_size, size, hdr);
|
|
|
|
|
|
|
|
l2arc_free_data_on_write(hdr->b_l1hdr.b_pdata, size, type);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Share the arc_buf_t's data with the hdr. Whenever we are sharing the
|
|
|
|
* data buffer, we transfer the refcount ownership to the hdr and update
|
|
|
|
* the appropriate kstats.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(arc_can_share(hdr, buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Start sharing the data buffer. We transfer the
|
|
|
|
* refcount ownership to the hdr since it always owns
|
|
|
|
* the refcount whenever an arc_buf_t is shared.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
refcount_transfer_ownership(&hdr->b_l1hdr.b_state->arcs_size, buf, hdr);
|
|
|
|
hdr->b_l1hdr.b_pdata = buf->b_data;
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags |= ARC_BUF_FLAG_SHARED;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Since we've transferred ownership to the hdr we need
|
|
|
|
* to increment its compressed and uncompressed kstats and
|
|
|
|
* decrement the overhead size.
|
|
|
|
*/
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
static void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(arc_buf_is_shared(buf));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* We are no longer sharing this buffer so we need
|
|
|
|
* to transfer its ownership to the rightful owner.
|
|
|
|
*/
|
|
|
|
refcount_transfer_ownership(&hdr->b_l1hdr.b_state->arcs_size, hdr, buf);
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
hdr->b_l1hdr.b_pdata = NULL;
|
2016-07-14 00:17:41 +03:00
|
|
|
buf->b_flags &= ~ARC_BUF_FLAG_SHARED;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Since the buffer is no longer shared between
|
|
|
|
* the arc buf and the hdr, count it as overhead.
|
|
|
|
*/
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
|
2016-07-11 20:45:52 +03:00
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Remove an arc_buf_t from the hdr's buf list and return the last
|
|
|
|
* arc_buf_t on the list. If no buffers remain on the list then return
|
|
|
|
* NULL.
|
|
|
|
*/
|
|
|
|
static arc_buf_t *
|
|
|
|
arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;
|
|
|
|
arc_buf_t *lastbuf = NULL;
|
|
|
|
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove the buf from the hdr list and locate the last
|
|
|
|
* remaining buffer on the list.
|
|
|
|
*/
|
|
|
|
while (*bufp != NULL) {
|
|
|
|
if (*bufp == buf)
|
|
|
|
*bufp = buf->b_next;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've removed a buffer in the middle of
|
|
|
|
* the list then update the lastbuf and update
|
|
|
|
* bufp.
|
|
|
|
*/
|
|
|
|
if (*bufp != NULL) {
|
|
|
|
lastbuf = *bufp;
|
|
|
|
bufp = &(*bufp)->b_next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
buf->b_next = NULL;
|
|
|
|
ASSERT3P(lastbuf, !=, buf);
|
|
|
|
IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL);
|
|
|
|
IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL);
|
|
|
|
IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));
|
|
|
|
|
|
|
|
return (lastbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free up buf->b_data and pull the arc_buf_t off of the the arc_buf_hdr_t's
|
|
|
|
* list and free it.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static void
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(arc_buf_t *buf)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *lastbuf;
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Free up the data associated with the buf but only if we're not
|
|
|
|
* sharing this with the hdr. If we are sharing it with the hdr, the
|
|
|
|
* hdr is responsible for doing the free.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (buf->b_data != NULL) {
|
|
|
|
/*
|
|
|
|
* We're about to change the hdr's b_flags. We must either
|
|
|
|
* hold the hash_lock or be undiscoverable.
|
|
|
|
*/
|
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)) || HDR_EMPTY(hdr));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_cksum_verify(buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_unwatch(buf);
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
if (arc_buf_is_shared(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);
|
|
|
|
} else {
|
2016-07-11 20:45:52 +03:00
|
|
|
uint64_t size = arc_buf_size(buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_free_data_buf(hdr, buf->b_data, size, buf);
|
|
|
|
ARCSTAT_INCR(arcstat_overhead_size, -size);
|
|
|
|
}
|
|
|
|
buf->b_data = NULL;
|
|
|
|
|
|
|
|
ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
|
|
|
|
hdr->b_l1hdr.b_bufcnt -= 1;
|
|
|
|
}
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
lastbuf = arc_buf_remove(hdr, buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* If the current arc_buf_t is sharing its data buffer with the
|
|
|
|
* hdr, then reassign the hdr's b_pdata to share it with the new
|
|
|
|
* buffer at the end of the list. The shared buffer is always
|
|
|
|
* the last one on the hdr's buffer list.
|
|
|
|
*
|
|
|
|
* There is an equivalent case for compressed bufs, but since
|
|
|
|
* they aren't guaranteed to be the last buf in the list and
|
|
|
|
* that is an exceedingly rare case, we just allow that space be
|
|
|
|
* wasted temporarily.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
|
|
|
if (lastbuf != NULL) {
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Only one buf can be shared at once */
|
2016-07-11 20:45:52 +03:00
|
|
|
VERIFY(!arc_buf_is_shared(lastbuf));
|
2016-07-14 00:17:41 +03:00
|
|
|
/* hdr is uncompressed so can't have compressed buf */
|
|
|
|
VERIFY(!ARC_BUF_COMPRESSED(lastbuf));
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
arc_hdr_free_pdata(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* We must setup a new shared block between the
|
|
|
|
* last buffer and the hdr. The data would have
|
|
|
|
* been allocated by the arc buf so we need to transfer
|
|
|
|
* ownership to the hdr since it's now being shared.
|
|
|
|
*/
|
|
|
|
arc_share_buf(hdr, lastbuf);
|
|
|
|
}
|
|
|
|
} else if (HDR_SHARED_DATA(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
2016-07-11 20:45:52 +03:00
|
|
|
* Uncompressed shared buffers are always at the end
|
|
|
|
* of the list. Compressed buffers don't have the
|
|
|
|
* same requirements. This makes it hard to
|
|
|
|
* simply assert that the lastbuf is shared so
|
|
|
|
* we rely on the hdr's compression flags to determine
|
|
|
|
* if we have a compressed, shared buffer.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(lastbuf, !=, NULL);
|
|
|
|
ASSERT(arc_buf_is_shared(lastbuf) ||
|
|
|
|
HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_bufcnt == 0)
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
|
|
|
|
/* clean up the buf */
|
|
|
|
buf->b_hdr = NULL;
|
|
|
|
kmem_cache_free(buf_cache, buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_hdr_alloc_pdata(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr));
|
|
|
|
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
hdr->b_l1hdr.b_pdata = arc_get_data_buf(hdr, arc_hdr_size(hdr), hdr);
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
|
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_hdr_free_pdata(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If the hdr is currently being written to the l2arc then
|
|
|
|
* we defer freeing the data by adding it to the l2arc_free_on_write
|
|
|
|
* list. The l2arc will free the data once it's finished
|
|
|
|
* writing it to the l2arc device.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_L2_WRITING(hdr)) {
|
|
|
|
arc_hdr_free_on_write(hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_free_on_write);
|
|
|
|
} else {
|
|
|
|
arc_free_data_buf(hdr, hdr->b_l1hdr.b_pdata,
|
|
|
|
arc_hdr_size(hdr), hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_pdata = NULL;
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
|
|
|
|
|
|
|
static arc_buf_hdr_t *
|
|
|
|
arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,
|
2016-07-11 20:45:52 +03:00
|
|
|
enum zio_compress compression_type, arc_buf_contents_t type)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
|
|
|
|
VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);
|
|
|
|
|
|
|
|
hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);
|
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
|
|
|
HDR_SET_PSIZE(hdr, psize);
|
|
|
|
HDR_SET_LSIZE(hdr, lsize);
|
|
|
|
hdr->b_spa = spa;
|
|
|
|
hdr->b_type = type;
|
|
|
|
hdr->b_flags = 0;
|
|
|
|
arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_hdr_set_compress(hdr, compression_type);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_state = arc_anon;
|
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
|
|
|
hdr->b_l1hdr.b_bufcnt = 0;
|
|
|
|
hdr->b_l1hdr.b_buf = NULL;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Allocate the hdr's buffer. This will contain either
|
|
|
|
* the compressed or uncompressed data depending on the block
|
|
|
|
* it references and compressed arc enablement.
|
|
|
|
*/
|
|
|
|
arc_hdr_alloc_pdata(hdr);
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
2014-07-15 11:43:18 +04:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Transition between the two allocation states for the arc_buf_hdr struct.
|
|
|
|
* The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without
|
|
|
|
* (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller
|
|
|
|
* version is used when a cache buffer is only in the L2ARC in order to reduce
|
|
|
|
* memory usage.
|
2014-07-15 11:43:18 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static arc_buf_hdr_t *
|
|
|
|
arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *nhdr;
|
|
|
|
l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||
|
|
|
|
(old == hdr_l2only_cache && new == hdr_full_cache));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));
|
|
|
|
buf_hash_remove(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
bcopy(hdr, nhdr, HDR_L2ONLY_SIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (new == hdr_full_cache) {
|
|
|
|
arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);
|
|
|
|
/*
|
|
|
|
* arc_access and arc_change_state need to be aware that a
|
|
|
|
* header has just come out of L2ARC, so we set its state to
|
|
|
|
* l2c_only even though it's about to change.
|
|
|
|
*/
|
|
|
|
nhdr->b_l1hdr.b_state = arc_l2c_only;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* Verify previous threads set to NULL before freeing */
|
|
|
|
ASSERT3P(nhdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
} else {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
ASSERT0(hdr->b_l1hdr.b_bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* If we've reached here, We must have been called from
|
|
|
|
* arc_evict_hdr(), as such we should have already been
|
|
|
|
* removed from any ghost list we were previously on
|
|
|
|
* (which protects us from racing with arc_evict_state),
|
|
|
|
* thus no locking is needed during this check.
|
|
|
|
*/
|
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2012-12-22 02:57:09 +04:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* A buffer must not be moved into the arc_l2c_only
|
|
|
|
* state if it's not finished being written out to the
|
|
|
|
* l2arc device. Otherwise, the b_l1hdr.b_pdata field
|
|
|
|
* might try to be accessed, even though it was removed.
|
2012-12-22 02:57:09 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY(!HDR_L2_WRITING(hdr));
|
|
|
|
VERIFY3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
|
|
|
|
arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The header has been reallocated so we need to re-insert it into any
|
|
|
|
* lists it was on.
|
|
|
|
*/
|
|
|
|
(void) buf_hash_insert(nhdr, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We must place the realloc'ed header back into the list at
|
|
|
|
* the same spot. Otherwise, if it's placed earlier in the list,
|
|
|
|
* l2arc_write_buffers() could find it during the function's
|
|
|
|
* write phase, and try to write it out to the l2arc.
|
|
|
|
*/
|
|
|
|
list_insert_after(&dev->l2ad_buflist, hdr, nhdr);
|
|
|
|
list_remove(&dev->l2ad_buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Since we're using the pointer address as the tag when
|
|
|
|
* incrementing and decrementing the l2ad_alloc refcount, we
|
|
|
|
* must remove the old pointer (that we're about to destroy) and
|
|
|
|
* add the new pointer to the refcount. Otherwise we'd remove
|
|
|
|
* the wrong pointer address when calling arc_hdr_destroy() later.
|
|
|
|
*/
|
|
|
|
|
|
|
|
(void) refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);
|
|
|
|
(void) refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr);
|
|
|
|
|
|
|
|
buf_discard_identity(hdr);
|
|
|
|
kmem_cache_free(old, hdr);
|
|
|
|
|
|
|
|
return (nhdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.
|
|
|
|
* The buf is returned thawed since we expect the consumer to modify it.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size)
|
2016-06-02 07:04:53 +03:00
|
|
|
{
|
|
|
|
arc_buf_t *buf;
|
|
|
|
arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,
|
|
|
|
ZIO_COMPRESS_OFF, type);
|
|
|
|
ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
|
2016-07-11 20:45:52 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
buf = NULL;
|
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, tag, B_FALSE, B_FALSE, &buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_thaw(buf);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this
|
|
|
|
* for bufs containing metadata.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
|
|
|
arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize,
|
|
|
|
enum zio_compress compression_type)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
ASSERT3U(lsize, >, 0);
|
|
|
|
ASSERT3U(lsize, >=, psize);
|
|
|
|
ASSERT(compression_type > ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT(compression_type < ZIO_COMPRESS_FUNCTIONS);
|
|
|
|
|
|
|
|
hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
|
|
|
|
compression_type, ARC_BUFC_DATA);
|
|
|
|
ASSERT(!MUTEX_HELD(HDR_LOCK(hdr)));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
buf = NULL;
|
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, tag, B_TRUE, B_FALSE, &buf));
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
return (buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;
|
|
|
|
l2arc_dev_t *dev = l2hdr->b_dev;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t asize = arc_hdr_size(hdr);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&dev->l2ad_mtx));
|
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
|
|
|
|
list_remove(&dev->l2ad_buflist, hdr);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -asize);
|
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -HDR_GET_LSIZE(hdr));
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -asize, 0, 0);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_remove_many(&dev->l2ad_alloc, asize, hdr);
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
|
2015-06-16 02:12:19 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_destroy(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
ASSERT(hdr->b_l1hdr.b_buf == NULL ||
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_bufcnt > 0);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!HDR_EMPTY(hdr))
|
|
|
|
buf_discard_identity(hdr);
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2015-06-16 02:12:19 +03:00
|
|
|
l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;
|
|
|
|
boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-06-16 02:12:19 +03:00
|
|
|
if (!buflist_held)
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
2015-06-16 02:12:19 +03:00
|
|
|
* Even though we checked this conditional above, we
|
|
|
|
* need to check this again now that we have the
|
|
|
|
* l2ad_mtx. This is because we could be racing with
|
|
|
|
* another thread calling l2arc_evict() which might have
|
|
|
|
* destroyed this header's L2 portion as we were waiting
|
|
|
|
* to acquire the l2ad_mtx. If that happens, we don't
|
|
|
|
* want to re-destroy the header's L2 portion.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2015-06-16 02:12:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (!buflist_held)
|
2015-06-16 02:12:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
|
|
|
arc_cksum_free(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
while (hdr->b_l1hdr.b_buf != NULL)
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
arc_hdr_free_pdata(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3P(hdr->b_hash_next, ==, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L1HDR(hdr)) {
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
|
|
|
kmem_cache_free(hdr_full_cache, hdr);
|
|
|
|
} else {
|
|
|
|
kmem_cache_free(hdr_l2only_cache, hdr);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(arc_buf_t *buf, void* tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2015-06-29 20:02:03 +03:00
|
|
|
kmutex_t *hash_lock = HDR_LOCK(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
VERIFY0(remove_reference(hdr, NULL, tag));
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(hash_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr, ==, buf->b_hdr);
|
|
|
|
ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);
|
|
|
|
ASSERT3P(buf->b_data, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
(void) remove_reference(hdr, hash_lock, tag);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Evict the arc_buf_hdr that is provided as a parameter. The resultant
|
|
|
|
* state of the header is dependent on its state prior to entering this
|
|
|
|
* function. The following transitions are possible:
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2015-01-13 06:52:19 +03:00
|
|
|
* - arc_mru -> arc_mru_ghost
|
|
|
|
* - arc_mfu -> arc_mfu_ghost
|
|
|
|
* - arc_mru_ghost -> arc_l2c_only
|
|
|
|
* - arc_mru_ghost -> deleted
|
|
|
|
* - arc_mfu_ghost -> arc_l2c_only
|
|
|
|
* - arc_mfu_ghost -> deleted
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
static int64_t
|
|
|
|
arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_state_t *evicted_state, *state;
|
|
|
|
int64_t bytes_evicted = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
state = hdr->b_l1hdr.b_state;
|
|
|
|
if (GHOST_STATE(state)) {
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* l2arc_write_buffers() relies on a header's L1 portion
|
2016-06-02 07:04:53 +03:00
|
|
|
* (i.e. its b_pdata field) during its write phase.
|
2015-01-13 06:52:19 +03:00
|
|
|
* Thus, we cannot push a header onto the arc_l2c_only
|
|
|
|
* state (removing its L1 piece) until the header is
|
|
|
|
* done being written to the l2arc.
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_l2_skip);
|
|
|
|
return (bytes_evicted);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_deleted);
|
2016-06-02 07:04:53 +03:00
|
|
|
bytes_evicted += HDR_GET_LSIZE(hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_pdata == NULL);
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* This buffer is cached on the 2nd Level ARC;
|
|
|
|
* don't destroy the header.
|
|
|
|
*/
|
|
|
|
arc_change_state(arc_l2c_only, hdr, hash_lock);
|
|
|
|
/*
|
|
|
|
* dropping from L1+L2 cached to L2-only,
|
|
|
|
* realloc to remove the L1 header.
|
|
|
|
*/
|
|
|
|
hdr = arc_hdr_realloc(hdr, hdr_full_cache,
|
|
|
|
hdr_l2only_cache);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
arc_hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(state == arc_mru || state == arc_mfu);
|
|
|
|
evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* prefetch buffers have a minimum lifespan */
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr) ||
|
|
|
|
((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
|
|
|
|
ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
|
|
|
|
arc_min_prefetch_lifespan)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_skip);
|
|
|
|
return (bytes_evicted);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
while (hdr->b_l1hdr.b_buf) {
|
|
|
|
arc_buf_t *buf = hdr->b_l1hdr.b_buf;
|
|
|
|
if (!mutex_tryenter(&buf->b_evict_lock)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_mutex_miss);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (buf->b_data != NULL)
|
2016-06-02 07:04:53 +03:00
|
|
|
bytes_evicted += HDR_GET_LSIZE(hdr);
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_destroy_impl(buf);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
if (l2arc_write_eligible(hdr->b_spa, hdr)) {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_eligible,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
|
|
|
} else {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_ineligible,
|
|
|
|
HDR_GET_LSIZE(hdr));
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_bufcnt == 0) {
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
|
|
|
|
bytes_evicted += arc_hdr_size(hdr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this hdr is being evicted and has a compressed
|
|
|
|
* buffer then we discard it here before we change states.
|
|
|
|
* This ensures that the accounting is updated correctly
|
|
|
|
* in arc_free_data_buf().
|
|
|
|
*/
|
|
|
|
arc_hdr_free_pdata(hdr);
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_change_state(evicted_state, hdr, hash_lock);
|
|
|
|
ASSERT(HDR_IN_HASH_TABLE(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);
|
2015-01-13 06:52:19 +03:00
|
|
|
DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
static uint64_t
|
|
|
|
arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,
|
|
|
|
uint64_t spa, int64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_t *mls;
|
|
|
|
uint64_t bytes_evicted = 0;
|
|
|
|
arc_buf_hdr_t *hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
2015-01-13 06:52:19 +03:00
|
|
|
int evict_count = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT3P(marker, !=, NULL);
|
2015-06-29 20:02:03 +03:00
|
|
|
IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
mls = multilist_sublist_lock(ml, idx);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL;
|
|
|
|
hdr = multilist_sublist_prev(mls, marker)) {
|
|
|
|
if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) ||
|
|
|
|
(evict_count >= zfs_arc_evict_batch_limit))
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To keep our iteration location, move the marker
|
|
|
|
* forward. Since we're not holding hdr's hash lock, we
|
|
|
|
* must be very careful and not remove 'hdr' from the
|
|
|
|
* sublist. Otherwise, other consumers might mistake the
|
|
|
|
* 'hdr' as not being on a sublist when they call the
|
|
|
|
* multilist_link_active() function (they all rely on
|
|
|
|
* the hash lock protecting concurrent insertions and
|
|
|
|
* removals). multilist_sublist_move_forward() was
|
|
|
|
* specifically implemented to ensure this is the case
|
|
|
|
* (only 'marker' will be removed and re-inserted).
|
|
|
|
*/
|
|
|
|
multilist_sublist_move_forward(mls, marker);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The only case where the b_spa field should ever be
|
|
|
|
* zero, is the marker headers inserted by
|
|
|
|
* arc_evict_state(). It's possible for multiple threads
|
|
|
|
* to be calling arc_evict_state() concurrently (e.g.
|
|
|
|
* dsl_pool_close() and zio_inject_fault()), so we must
|
|
|
|
* skip any markers we see from these other threads.
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
if (hdr->b_spa == 0)
|
2010-08-27 01:24:34 +04:00
|
|
|
continue;
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* we're only interested in evicting buffers of a certain spa */
|
|
|
|
if (spa != 0 && hdr->b_spa != spa) {
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_skip);
|
2010-05-29 00:45:14 +04:00
|
|
|
continue;
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
hash_lock = HDR_LOCK(hdr);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* We aren't calling this function from any code path
|
|
|
|
* that would already be holding a hash lock, so we're
|
|
|
|
* asserting on this assumption to be defensive in case
|
|
|
|
* this ever changes. Without this check, it would be
|
|
|
|
* possible to incorrectly increment arcstat_mutex_miss
|
|
|
|
* below (e.g. if the code changed such that we called
|
|
|
|
* this function with a hash lock held).
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!MUTEX_HELD(hash_lock));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (mutex_tryenter(hash_lock)) {
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t evicted = arc_evict_hdr(hdr, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
bytes_evicted += evicted;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* If evicted is zero, arc_evict_hdr() must have
|
|
|
|
* decided to skip this header, don't increment
|
|
|
|
* evict_count in this case.
|
2010-08-27 01:24:34 +04:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (evicted != 0)
|
|
|
|
evict_count++;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If arc_size isn't overflowing, signal any
|
|
|
|
* threads that might happen to be waiting.
|
|
|
|
*
|
|
|
|
* For each header evicted, we wake up a single
|
|
|
|
* thread. If we used cv_broadcast, we could
|
|
|
|
* wake up "too many" threads causing arc_size
|
|
|
|
* to significantly overflow arc_c; since
|
|
|
|
* arc_get_data_buf() doesn't check for overflow
|
|
|
|
* when it's woken up (it doesn't because it's
|
|
|
|
* possible for the ARC to be overflowing while
|
|
|
|
* full of un-evictable buffers, and the
|
|
|
|
* function should proceed in this case).
|
|
|
|
*
|
|
|
|
* If threads are left sleeping, due to not
|
|
|
|
* using cv_broadcast, they will be woken up
|
|
|
|
* just before arc_reclaim_thread() sleeps.
|
|
|
|
*/
|
|
|
|
mutex_enter(&arc_reclaim_lock);
|
|
|
|
if (!arc_is_overflowing())
|
|
|
|
cv_signal(&arc_reclaim_waiters_cv);
|
|
|
|
mutex_exit(&arc_reclaim_lock);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
} else {
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mutex_miss);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
return (bytes_evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Evict buffers from the given arc state, until we've removed the
|
|
|
|
* specified number of bytes. Move the removed buffers to the
|
|
|
|
* appropriate evict state.
|
|
|
|
*
|
|
|
|
* This function makes a "best effort". It skips over any buffers
|
|
|
|
* it can't get a hash_lock on, and so, may not catch all candidates.
|
|
|
|
* It may also return without evicting as much space as requested.
|
|
|
|
*
|
|
|
|
* If bytes is specified using the special value ARC_EVICT_ALL, this
|
|
|
|
* will evict all available (i.e. unlocked and evictable) buffers from
|
|
|
|
* the given arc state; which is used by arc_flush().
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
|
|
|
|
arc_buf_contents_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t total_evicted = 0;
|
|
|
|
multilist_t *ml = &state->arcs_list[type];
|
|
|
|
int num_sublists;
|
|
|
|
arc_buf_hdr_t **markers;
|
|
|
|
int i;
|
|
|
|
|
2015-06-29 20:02:03 +03:00
|
|
|
IMPLY(bytes < 0, bytes == ARC_EVICT_ALL);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
num_sublists = multilist_get_num_sublists(ml);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* If we've tried to evict from each sublist, made some
|
|
|
|
* progress, but still have not hit the target number of bytes
|
|
|
|
* to evict, we want to keep trying. The markers allow us to
|
|
|
|
* pick up where we left off for each individual sublist, rather
|
|
|
|
* than starting from the tail each time.
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
|
|
|
|
for (i = 0; i < num_sublists; i++) {
|
|
|
|
multilist_sublist_t *mls;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A b_spa of 0 is used to indicate that this header is
|
|
|
|
* a marker. This fact is used in arc_adjust_type() and
|
|
|
|
* arc_evict_state_impl().
|
|
|
|
*/
|
|
|
|
markers[i]->b_spa = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mls = multilist_sublist_lock(ml, i);
|
|
|
|
multilist_sublist_insert_tail(mls, markers[i]);
|
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* While we haven't hit our target number of bytes to evict, or
|
|
|
|
* we're evicting all available buffers.
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
while (total_evicted < bytes || bytes == ARC_EVICT_ALL) {
|
2016-07-13 15:42:40 +03:00
|
|
|
int sublist_idx = multilist_get_random_index(ml);
|
|
|
|
uint64_t scan_evicted = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to reduce pinned dnodes with a floor of arc_dnode_limit.
|
|
|
|
* Request that 10% of the LRUs be scanned by the superblock
|
|
|
|
* shrinker.
|
|
|
|
*/
|
|
|
|
if (type == ARC_BUFC_DATA && arc_dnode_size > arc_dnode_limit)
|
|
|
|
arc_prune_async((arc_dnode_size - arc_dnode_limit) /
|
|
|
|
sizeof (dnode_t) / zfs_arc_dnode_reduce_percent);
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Start eviction using a randomly selected sublist,
|
|
|
|
* this is to try and evenly balance eviction across all
|
|
|
|
* sublists. Always starting at the same sublist
|
|
|
|
* (e.g. index 0) would cause evictions to favor certain
|
|
|
|
* sublists over others.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < num_sublists; i++) {
|
|
|
|
uint64_t bytes_remaining;
|
|
|
|
uint64_t bytes_evicted;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (bytes == ARC_EVICT_ALL)
|
|
|
|
bytes_remaining = ARC_EVICT_ALL;
|
|
|
|
else if (total_evicted < bytes)
|
|
|
|
bytes_remaining = bytes - total_evicted;
|
|
|
|
else
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
bytes_evicted = arc_evict_state_impl(ml, sublist_idx,
|
|
|
|
markers[sublist_idx], spa, bytes_remaining);
|
|
|
|
|
|
|
|
scan_evicted += bytes_evicted;
|
|
|
|
total_evicted += bytes_evicted;
|
|
|
|
|
|
|
|
/* we've reached the end, wrap to the beginning */
|
|
|
|
if (++sublist_idx >= num_sublists)
|
|
|
|
sublist_idx = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we didn't evict anything during this scan, we have
|
|
|
|
* no reason to believe we'll evict more during another
|
|
|
|
* scan, so break the loop.
|
|
|
|
*/
|
|
|
|
if (scan_evicted == 0) {
|
|
|
|
/* This isn't possible, let's make that obvious */
|
|
|
|
ASSERT3S(bytes, !=, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* When bytes is ARC_EVICT_ALL, the only way to
|
|
|
|
* break the loop is when scan_evicted is zero.
|
|
|
|
* In that case, we actually have evicted enough,
|
|
|
|
* so we don't want to increment the kstat.
|
|
|
|
*/
|
|
|
|
if (bytes != ARC_EVICT_ALL) {
|
|
|
|
ASSERT3S(total_evicted, <, bytes);
|
|
|
|
ARCSTAT_BUMP(arcstat_evict_not_enough);
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
break;
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
for (i = 0; i < num_sublists; i++) {
|
|
|
|
multilist_sublist_t *mls = multilist_sublist_lock(ml, i);
|
|
|
|
multilist_sublist_remove(mls, markers[i]);
|
|
|
|
multilist_sublist_unlock(mls);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
kmem_cache_free(hdr_full_cache, markers[i]);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
kmem_free(markers, sizeof (*markers) * num_sublists);
|
|
|
|
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush all "evictable" data of the given type from the arc state
|
|
|
|
* specified. This will not evict any "active" buffers (i.e. referenced).
|
|
|
|
*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When 'retry' is set to B_FALSE, the function will make a single pass
|
2015-01-13 06:52:19 +03:00
|
|
|
* over the state and evict any buffers that it can. Since it doesn't
|
|
|
|
* continually retry the eviction, it might end up leaving some buffers
|
|
|
|
* in the ARC due to lock misses.
|
|
|
|
*
|
2016-06-02 07:04:53 +03:00
|
|
|
* When 'retry' is set to B_TRUE, the function will continually retry the
|
2015-01-13 06:52:19 +03:00
|
|
|
* eviction until *all* evictable buffers have been removed from the
|
|
|
|
* state. As a result, if concurrent insertions into the state are
|
|
|
|
* allowed (e.g. if the ARC isn't shutting down), this function might
|
|
|
|
* wind up in an infinite loop, continually trying to evict buffers.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,
|
|
|
|
boolean_t retry)
|
|
|
|
{
|
|
|
|
uint64_t evicted = 0;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
while (refcount_count(&state->arcs_esize[type]) != 0) {
|
2015-01-13 06:52:19 +03:00
|
|
|
evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type);
|
|
|
|
|
|
|
|
if (!retry)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (evicted);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
/*
|
2015-09-24 01:59:04 +03:00
|
|
|
* Helper function for arc_prune_async() it is responsible for safely
|
|
|
|
* handling the execution of a registered arc_prune_func_t.
|
2011-12-23 00:20:43 +04:00
|
|
|
*/
|
|
|
|
static void
|
2015-05-30 17:57:53 +03:00
|
|
|
arc_prune_task(void *ptr)
|
2011-12-23 00:20:43 +04:00
|
|
|
{
|
2015-05-30 17:57:53 +03:00
|
|
|
arc_prune_t *ap = (arc_prune_t *)ptr;
|
|
|
|
arc_prune_func_t *func = ap->p_pfunc;
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
if (func != NULL)
|
|
|
|
func(ap->p_adjust, ap->p_private);
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2016-05-23 21:58:21 +03:00
|
|
|
refcount_remove(&ap->p_refcnt, func);
|
2015-05-30 17:57:53 +03:00
|
|
|
}
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
/*
|
|
|
|
* Notify registered consumers they must drop holds on a portion of the ARC
|
|
|
|
* buffered they reference. This provides a mechanism to ensure the ARC can
|
|
|
|
* honor the arc_meta_limit and reclaim otherwise pinned ARC buffers. This
|
|
|
|
* is analogous to dnlc_reduce_cache() but more generic.
|
|
|
|
*
|
2015-09-24 01:59:04 +03:00
|
|
|
* This operation is performed asynchronously so it may be safely called
|
2015-06-26 21:28:18 +03:00
|
|
|
* in the context of the arc_reclaim_thread(). A reference is taken here
|
2015-05-30 17:57:53 +03:00
|
|
|
* for each registered arc_prune_t and the arc_prune_task() is responsible
|
|
|
|
* for releasing it once the registered arc_prune_func_t has completed.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_prune_async(int64_t adjust)
|
|
|
|
{
|
|
|
|
arc_prune_t *ap;
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
for (ap = list_head(&arc_prune_list); ap != NULL;
|
|
|
|
ap = list_next(&arc_prune_list, ap)) {
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
if (refcount_count(&ap->p_refcnt) >= 2)
|
|
|
|
continue;
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
refcount_add(&ap->p_refcnt, ap->p_pfunc);
|
|
|
|
ap->p_adjust = adjust;
|
|
|
|
taskq_dispatch(arc_prune_taskq, arc_prune_task, ap, TQ_SLEEP);
|
|
|
|
ARCSTAT_BUMP(arcstat_prune);
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Evict the specified number of bytes from the state specified,
|
|
|
|
* restricting eviction to the spa and type given. This function
|
|
|
|
* prevents us from trying to evict more from a state's list than
|
|
|
|
* is "evictable", and to skip evicting altogether when passed a
|
|
|
|
* negative value for "bytes". In contrast, arc_evict_state() will
|
|
|
|
* evict everything it can, when passed a negative value for "bytes".
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_adjust_impl(arc_state_t *state, uint64_t spa, int64_t bytes,
|
|
|
|
arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
int64_t delta;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (bytes > 0 && refcount_count(&state->arcs_esize[type]) > 0) {
|
|
|
|
delta = MIN(refcount_count(&state->arcs_esize[type]), bytes);
|
2015-01-13 06:52:19 +03:00
|
|
|
return (arc_evict_state(state, spa, delta, type));
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The goal of this function is to evict enough meta data buffers from the
|
|
|
|
* ARC in order to enforce the arc_meta_limit. Achieving this is slightly
|
|
|
|
* more complicated than it appears because it is common for data buffers
|
|
|
|
* to have holds on meta data buffers. In addition, dnode meta data buffers
|
|
|
|
* will be held by the dnodes in the block preventing them from being freed.
|
|
|
|
* This means we can't simply traverse the ARC and expect to always find
|
|
|
|
* enough unheld meta data buffer to release.
|
|
|
|
*
|
|
|
|
* Therefore, this function has been updated to make alternating passes
|
|
|
|
* over the ARC releasing data buffers and then newly unheld meta data
|
|
|
|
* buffers. This ensures forward progress is maintained and arc_meta_used
|
|
|
|
* will decrease. Normally this is sufficient, but if required the ARC
|
|
|
|
* will call the registered prune callbacks causing dentry and inodes to
|
|
|
|
* be dropped from the VFS cache. This will make dnode meta data buffers
|
|
|
|
* available for reclaim.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
2015-05-30 17:57:53 +03:00
|
|
|
arc_adjust_meta_balanced(void)
|
2015-01-13 06:52:19 +03:00
|
|
|
{
|
2016-09-19 19:28:35 +03:00
|
|
|
int64_t delta, prune = 0, adjustmnt;
|
|
|
|
uint64_t total_evicted = 0;
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_buf_contents_t type = ARC_BUFC_DATA;
|
2015-06-26 21:28:18 +03:00
|
|
|
int restarts = MAX(zfs_arc_meta_adjust_restarts, 0);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
restart:
|
|
|
|
/*
|
|
|
|
* This slightly differs than the way we evict from the mru in
|
|
|
|
* arc_adjust because we don't have a "target" value (i.e. no
|
|
|
|
* "meta" arc_p). As a result, I think we can completely
|
|
|
|
* cannibalize the metadata in the MRU before we evict the
|
|
|
|
* metadata from the MFU. I think we probably need to implement a
|
|
|
|
* "metadata arc_p" value to do this properly.
|
|
|
|
*/
|
|
|
|
adjustmnt = arc_meta_used - arc_meta_limit;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (adjustmnt > 0 && refcount_count(&arc_mru->arcs_esize[type]) > 0) {
|
|
|
|
delta = MIN(refcount_count(&arc_mru->arcs_esize[type]),
|
|
|
|
adjustmnt);
|
2015-01-13 06:52:19 +03:00
|
|
|
total_evicted += arc_adjust_impl(arc_mru, 0, delta, type);
|
|
|
|
adjustmnt -= delta;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't afford to recalculate adjustmnt here. If we do,
|
|
|
|
* new metadata buffers can sneak into the MRU or ANON lists,
|
|
|
|
* thus penalize the MFU metadata. Although the fudge factor is
|
|
|
|
* small, it has been empirically shown to be significant for
|
|
|
|
* certain workloads (e.g. creating many empty directories). As
|
|
|
|
* such, we use the original calculation for adjustmnt, and
|
|
|
|
* simply decrement the amount of data evicted from the MRU.
|
|
|
|
*/
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (adjustmnt > 0 && refcount_count(&arc_mfu->arcs_esize[type]) > 0) {
|
|
|
|
delta = MIN(refcount_count(&arc_mfu->arcs_esize[type]),
|
|
|
|
adjustmnt);
|
2015-01-13 06:52:19 +03:00
|
|
|
total_evicted += arc_adjust_impl(arc_mfu, 0, delta, type);
|
|
|
|
}
|
|
|
|
|
|
|
|
adjustmnt = arc_meta_used - arc_meta_limit;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (adjustmnt > 0 &&
|
|
|
|
refcount_count(&arc_mru_ghost->arcs_esize[type]) > 0) {
|
2015-01-13 06:52:19 +03:00
|
|
|
delta = MIN(adjustmnt,
|
2016-06-02 07:04:53 +03:00
|
|
|
refcount_count(&arc_mru_ghost->arcs_esize[type]));
|
2015-01-13 06:52:19 +03:00
|
|
|
total_evicted += arc_adjust_impl(arc_mru_ghost, 0, delta, type);
|
|
|
|
adjustmnt -= delta;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (adjustmnt > 0 &&
|
|
|
|
refcount_count(&arc_mfu_ghost->arcs_esize[type]) > 0) {
|
2015-01-13 06:52:19 +03:00
|
|
|
delta = MIN(adjustmnt,
|
2016-06-02 07:04:53 +03:00
|
|
|
refcount_count(&arc_mfu_ghost->arcs_esize[type]));
|
2015-01-13 06:52:19 +03:00
|
|
|
total_evicted += arc_adjust_impl(arc_mfu_ghost, 0, delta, type);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If after attempting to make the requested adjustment to the ARC
|
|
|
|
* the meta limit is still being exceeded then request that the
|
|
|
|
* higher layers drop some cached objects which have holds on ARC
|
|
|
|
* meta buffers. Requests to the upper layers will be made with
|
|
|
|
* increasingly large scan sizes until the ARC is below the limit.
|
|
|
|
*/
|
|
|
|
if (arc_meta_used > arc_meta_limit) {
|
|
|
|
if (type == ARC_BUFC_DATA) {
|
|
|
|
type = ARC_BUFC_METADATA;
|
|
|
|
} else {
|
|
|
|
type = ARC_BUFC_DATA;
|
|
|
|
|
|
|
|
if (zfs_arc_meta_prune) {
|
|
|
|
prune += zfs_arc_meta_prune;
|
2015-05-30 17:57:53 +03:00
|
|
|
arc_prune_async(prune);
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (restarts > 0) {
|
|
|
|
restarts--;
|
|
|
|
goto restart;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
/*
|
|
|
|
* Evict metadata buffers from the cache, such that arc_meta_used is
|
|
|
|
* capped by the arc_meta_limit tunable.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_adjust_meta_only(void)
|
|
|
|
{
|
|
|
|
uint64_t total_evicted = 0;
|
|
|
|
int64_t target;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're over the meta limit, we want to evict enough
|
|
|
|
* metadata to get back under the meta limit. We don't want to
|
|
|
|
* evict so much that we drop the MRU below arc_p, though. If
|
|
|
|
* we're over the meta limit more than we're over arc_p, we
|
|
|
|
* evict some from the MRU here, and some from the MFU below.
|
|
|
|
*/
|
|
|
|
target = MIN((int64_t)(arc_meta_used - arc_meta_limit),
|
2015-06-27 01:14:45 +03:00
|
|
|
(int64_t)(refcount_count(&arc_anon->arcs_size) +
|
|
|
|
refcount_count(&arc_mru->arcs_size) - arc_p));
|
2015-05-30 17:57:53 +03:00
|
|
|
|
|
|
|
total_evicted += arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Similar to the above, we want to evict enough bytes to get us
|
|
|
|
* below the meta limit, but not so much as to drop us below the
|
2016-07-11 20:45:52 +03:00
|
|
|
* space allotted to the MFU (which is defined as arc_c - arc_p).
|
2015-05-30 17:57:53 +03:00
|
|
|
*/
|
|
|
|
target = MIN((int64_t)(arc_meta_used - arc_meta_limit),
|
2015-06-27 01:14:45 +03:00
|
|
|
(int64_t)(refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p)));
|
2015-05-30 17:57:53 +03:00
|
|
|
|
|
|
|
total_evicted += arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
|
|
|
|
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
|
|
|
arc_adjust_meta(void)
|
|
|
|
{
|
|
|
|
if (zfs_arc_meta_strategy == ARC_STRATEGY_META_ONLY)
|
|
|
|
return (arc_adjust_meta_only());
|
|
|
|
else
|
|
|
|
return (arc_adjust_meta_balanced());
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Return the type of the oldest buffer in the given arc state
|
|
|
|
*
|
|
|
|
* This function will select a random sublist of type ARC_BUFC_DATA and
|
|
|
|
* a random sublist of type ARC_BUFC_METADATA. The tail of each sublist
|
|
|
|
* is compared, and the type which contains the "older" buffer will be
|
|
|
|
* returned.
|
|
|
|
*/
|
|
|
|
static arc_buf_contents_t
|
|
|
|
arc_adjust_type(arc_state_t *state)
|
|
|
|
{
|
|
|
|
multilist_t *data_ml = &state->arcs_list[ARC_BUFC_DATA];
|
|
|
|
multilist_t *meta_ml = &state->arcs_list[ARC_BUFC_METADATA];
|
|
|
|
int data_idx = multilist_get_random_index(data_ml);
|
|
|
|
int meta_idx = multilist_get_random_index(meta_ml);
|
|
|
|
multilist_sublist_t *data_mls;
|
|
|
|
multilist_sublist_t *meta_mls;
|
|
|
|
arc_buf_contents_t type;
|
|
|
|
arc_buf_hdr_t *data_hdr;
|
|
|
|
arc_buf_hdr_t *meta_hdr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We keep the sublist lock until we're finished, to prevent
|
|
|
|
* the headers from being destroyed via arc_evict_state().
|
|
|
|
*/
|
|
|
|
data_mls = multilist_sublist_lock(data_ml, data_idx);
|
|
|
|
meta_mls = multilist_sublist_lock(meta_ml, meta_idx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* These two loops are to ensure we skip any markers that
|
|
|
|
* might be at the tail of the lists due to arc_evict_state().
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL;
|
|
|
|
data_hdr = multilist_sublist_prev(data_mls, data_hdr)) {
|
|
|
|
if (data_hdr->b_spa != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL;
|
|
|
|
meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) {
|
|
|
|
if (meta_hdr->b_spa != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (data_hdr == NULL && meta_hdr == NULL) {
|
|
|
|
type = ARC_BUFC_DATA;
|
|
|
|
} else if (data_hdr == NULL) {
|
|
|
|
ASSERT3P(meta_hdr, !=, NULL);
|
|
|
|
type = ARC_BUFC_METADATA;
|
|
|
|
} else if (meta_hdr == NULL) {
|
|
|
|
ASSERT3P(data_hdr, !=, NULL);
|
|
|
|
type = ARC_BUFC_DATA;
|
|
|
|
} else {
|
|
|
|
ASSERT3P(data_hdr, !=, NULL);
|
|
|
|
ASSERT3P(meta_hdr, !=, NULL);
|
|
|
|
|
|
|
|
/* The headers can't be on the sublist without an L1 header */
|
|
|
|
ASSERT(HDR_HAS_L1HDR(data_hdr));
|
|
|
|
ASSERT(HDR_HAS_L1HDR(meta_hdr));
|
|
|
|
|
|
|
|
if (data_hdr->b_l1hdr.b_arc_access <
|
|
|
|
meta_hdr->b_l1hdr.b_arc_access) {
|
|
|
|
type = ARC_BUFC_DATA;
|
|
|
|
} else {
|
|
|
|
type = ARC_BUFC_METADATA;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
multilist_sublist_unlock(meta_mls);
|
|
|
|
multilist_sublist_unlock(data_mls);
|
|
|
|
|
|
|
|
return (type);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evict buffers from the cache, such that arc_size is capped by arc_c.
|
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
arc_adjust(void)
|
|
|
|
{
|
|
|
|
uint64_t total_evicted = 0;
|
|
|
|
uint64_t bytes;
|
|
|
|
int64_t target;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're over arc_meta_limit, we want to correct that before
|
|
|
|
* potentially evicting data buffers below.
|
|
|
|
*/
|
|
|
|
total_evicted += arc_adjust_meta();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust MRU size
|
|
|
|
*
|
|
|
|
* If we're over the target cache size, we want to evict enough
|
|
|
|
* from the list to get back to our target size. We don't want
|
|
|
|
* to evict too much from the MRU, such that it drops below
|
|
|
|
* arc_p. So, if we're over our target cache size more than
|
|
|
|
* the MRU is over arc_p, we'll evict enough to get back to
|
|
|
|
* arc_p here, and then evict more from the MFU below.
|
|
|
|
*/
|
|
|
|
target = MIN((int64_t)(arc_size - arc_c),
|
2015-06-27 01:14:45 +03:00
|
|
|
(int64_t)(refcount_count(&arc_anon->arcs_size) +
|
|
|
|
refcount_count(&arc_mru->arcs_size) + arc_meta_used - arc_p));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're below arc_meta_min, always prefer to evict data.
|
|
|
|
* Otherwise, try to satisfy the requested number of bytes to
|
|
|
|
* evict from the type which contains older buffers; in an
|
|
|
|
* effort to keep newer buffers in the cache regardless of their
|
|
|
|
* type. If we cannot satisfy the number of bytes from this
|
|
|
|
* type, spill over into the next type.
|
|
|
|
*/
|
|
|
|
if (arc_adjust_type(arc_mru) == ARC_BUFC_METADATA &&
|
|
|
|
arc_meta_used > arc_meta_min) {
|
|
|
|
bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we couldn't evict our target number of bytes from
|
|
|
|
* metadata, we try to get the rest from data.
|
|
|
|
*/
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
|
|
|
|
} else {
|
|
|
|
bytes = arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_DATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we couldn't evict our target number of bytes from
|
|
|
|
* data, we try to get the rest from metadata.
|
|
|
|
*/
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mru, 0, target, ARC_BUFC_METADATA);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust MFU size
|
|
|
|
*
|
|
|
|
* Now that we've tried to evict enough from the MRU to get its
|
|
|
|
* size back to arc_p, if we're still above the target cache
|
|
|
|
* size, we evict the rest from the MFU.
|
|
|
|
*/
|
|
|
|
target = arc_size - arc_c;
|
|
|
|
|
2015-07-01 18:18:08 +03:00
|
|
|
if (arc_adjust_type(arc_mfu) == ARC_BUFC_METADATA &&
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_meta_used > arc_meta_min) {
|
|
|
|
bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we couldn't evict our target number of bytes from
|
|
|
|
* metadata, we try to get the rest from data.
|
|
|
|
*/
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
|
|
|
|
} else {
|
|
|
|
bytes = arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we couldn't evict our target number of bytes from
|
|
|
|
* data, we try to get the rest from data.
|
|
|
|
*/
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_METADATA);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust ghost lists
|
|
|
|
*
|
|
|
|
* In addition to the above, the ARC also defines target values
|
|
|
|
* for the ghost lists. The sum of the mru list and mru ghost
|
|
|
|
* list should never exceed the target size of the cache, and
|
|
|
|
* the sum of the mru list, mfu list, mru ghost list, and mfu
|
|
|
|
* ghost list should never exceed twice the target size of the
|
|
|
|
* cache. The following logic enforces these limits on the ghost
|
|
|
|
* caches, and evicts from them as needed.
|
|
|
|
*/
|
2015-06-27 01:14:45 +03:00
|
|
|
target = refcount_count(&arc_mru->arcs_size) +
|
|
|
|
refcount_count(&arc_mru_ghost->arcs_size) - arc_c;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
bytes = arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We assume the sum of the mru list and mfu list is less than
|
|
|
|
* or equal to arc_c (we enforced this above), which means we
|
|
|
|
* can use the simpler of the two equations below:
|
|
|
|
*
|
|
|
|
* mru + mfu + mru ghost + mfu ghost <= 2 * arc_c
|
|
|
|
* mru ghost + mfu ghost <= arc_c
|
|
|
|
*/
|
2015-06-27 01:14:45 +03:00
|
|
|
target = refcount_count(&arc_mru_ghost->arcs_size) +
|
|
|
|
refcount_count(&arc_mfu_ghost->arcs_size) - arc_c;
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
bytes = arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA);
|
|
|
|
total_evicted += bytes;
|
|
|
|
|
|
|
|
target -= bytes;
|
|
|
|
|
|
|
|
total_evicted +=
|
|
|
|
arc_adjust_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA);
|
|
|
|
|
|
|
|
return (total_evicted);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_flush(spa_t *spa, boolean_t retry)
|
2011-12-23 00:20:43 +04:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
uint64_t guid = 0;
|
2014-01-03 23:40:52 +04:00
|
|
|
|
2015-03-18 01:08:22 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If retry is B_TRUE, a spa must not be specified since we have
|
2015-01-13 06:52:19 +03:00
|
|
|
* no good way to determine if all of a spa's buffers have been
|
|
|
|
* evicted from an arc state.
|
2015-03-18 01:08:22 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(!retry || spa == 0);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (spa != NULL)
|
2011-11-12 02:07:54 +04:00
|
|
|
guid = spa_load_guid(spa);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
(void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);
|
|
|
|
|
|
|
|
(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);
|
|
|
|
|
|
|
|
(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);
|
|
|
|
(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_shrink(int64_t to_free)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-01-22 16:37:37 +03:00
|
|
|
uint64_t c = arc_c;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-22 16:37:37 +03:00
|
|
|
if (c > to_free && c - to_free > arc_c_min) {
|
|
|
|
arc_c = c - to_free;
|
2015-06-26 21:28:18 +03:00
|
|
|
atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
|
2008-11-20 23:01:55 +03:00
|
|
|
if (arc_c > arc_size)
|
|
|
|
arc_c = MAX(arc_size, arc_c_min);
|
|
|
|
if (arc_p > arc_c)
|
|
|
|
arc_p = (arc_c >> 1);
|
|
|
|
ASSERT(arc_c >= arc_c_min);
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
2016-01-22 16:37:37 +03:00
|
|
|
} else {
|
|
|
|
arc_c = arc_c_min;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (arc_size > arc_c)
|
2015-01-13 06:52:19 +03:00
|
|
|
(void) arc_adjust();
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
typedef enum free_memory_reason_t {
|
|
|
|
FMR_UNKNOWN,
|
|
|
|
FMR_NEEDFREE,
|
|
|
|
FMR_LOTSFREE,
|
|
|
|
FMR_SWAPFS_MINFREE,
|
|
|
|
FMR_PAGES_PP_MAXIMUM,
|
|
|
|
FMR_HEAP_ARENA,
|
|
|
|
FMR_ZIO_ARENA,
|
|
|
|
} free_memory_reason_t;
|
|
|
|
|
|
|
|
int64_t last_free_memory;
|
|
|
|
free_memory_reason_t last_free_reason;
|
|
|
|
|
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
|
|
|
* Additional reserve of pages for pp_reserve.
|
|
|
|
*/
|
|
|
|
int64_t arc_pages_pp_reserve = 64;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Additional reserve of pages for swapfs.
|
|
|
|
*/
|
|
|
|
int64_t arc_swapfs_reserve = 64;
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the amount of memory that can be consumed before reclaim will be
|
|
|
|
* needed. Positive if there is sufficient free memory, negative indicates
|
|
|
|
* the amount of memory that needs to be freed up.
|
|
|
|
*/
|
|
|
|
static int64_t
|
|
|
|
arc_available_memory(void)
|
|
|
|
{
|
|
|
|
int64_t lowest = INT64_MAX;
|
|
|
|
free_memory_reason_t r = FMR_UNKNOWN;
|
|
|
|
#ifdef _KERNEL
|
|
|
|
int64_t n;
|
2015-07-27 23:17:32 +03:00
|
|
|
#ifdef __linux__
|
|
|
|
pgcnt_t needfree = btop(arc_need_free);
|
|
|
|
pgcnt_t lotsfree = btop(arc_sys_free);
|
|
|
|
pgcnt_t desfree = 0;
|
|
|
|
#endif
|
2015-06-26 21:28:18 +03:00
|
|
|
|
|
|
|
if (needfree > 0) {
|
|
|
|
n = PAGESIZE * (-needfree);
|
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_NEEDFREE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check that we're out of range of the pageout scanner. It starts to
|
|
|
|
* schedule paging if freemem is less than lotsfree and needfree.
|
|
|
|
* lotsfree is the high-water mark for pageout, and needfree is the
|
|
|
|
* number of needed free pages. We add extra pages here to make sure
|
|
|
|
* the scanner doesn't start up while we're freeing memory.
|
|
|
|
*/
|
|
|
|
n = PAGESIZE * (freemem - lotsfree - needfree - desfree);
|
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_LOTSFREE;
|
|
|
|
}
|
|
|
|
|
2015-07-27 23:17:32 +03:00
|
|
|
#ifndef __linux__
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* check to make sure that swapfs has enough space so that anon
|
|
|
|
* reservations can still succeed. anon_resvmem() checks that the
|
|
|
|
* availrmem is greater than swapfs_minfree, and the number of reserved
|
|
|
|
* swap pages. We also add a bit of extra here just to prevent
|
|
|
|
* circumstances from getting really dire.
|
|
|
|
*/
|
|
|
|
n = PAGESIZE * (availrmem - swapfs_minfree - swapfs_reserve -
|
|
|
|
desfree - arc_swapfs_reserve);
|
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_SWAPFS_MINFREE;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check that we have enough availrmem that memory locking (e.g., via
|
|
|
|
* mlock(3C) or memcntl(2)) can still succeed. (pages_pp_maximum
|
|
|
|
* stores the number of pages that cannot be locked; when availrmem
|
|
|
|
* drops below pages_pp_maximum, page locking mechanisms such as
|
|
|
|
* page_pp_lock() will fail.)
|
|
|
|
*/
|
|
|
|
n = PAGESIZE * (availrmem - pages_pp_maximum -
|
|
|
|
arc_pages_pp_reserve);
|
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_PAGES_PP_MAXIMUM;
|
|
|
|
}
|
2015-07-27 23:17:32 +03:00
|
|
|
#endif
|
2015-06-26 21:28:18 +03:00
|
|
|
|
|
|
|
#if defined(__i386)
|
|
|
|
/*
|
|
|
|
* If we're on an i386 platform, it's possible that we'll exhaust the
|
|
|
|
* kernel heap space before we ever run out of available physical
|
|
|
|
* memory. Most checks of the size of the heap_area compare against
|
|
|
|
* tune.t_minarmem, which is the minimum available real memory that we
|
|
|
|
* can have in the system. However, this is generally fixed at 25 pages
|
|
|
|
* which is so low that it's useless. In this comparison, we seek to
|
|
|
|
* calculate the total heap-size, and reclaim if more than 3/4ths of the
|
|
|
|
* heap is allocated. (Or, in the calculation, if less than 1/4th is
|
|
|
|
* free)
|
|
|
|
*/
|
|
|
|
n = vmem_size(heap_arena, VMEM_FREE) -
|
|
|
|
(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC) >> 2);
|
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_HEAP_ARENA;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If zio data pages are being allocated out of a separate heap segment,
|
|
|
|
* then enforce that the size of available vmem for this arena remains
|
2016-06-02 07:04:53 +03:00
|
|
|
* above about 1/4th (1/(2^arc_zio_arena_free_shift)) free.
|
2015-06-26 21:28:18 +03:00
|
|
|
*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Note that reducing the arc_zio_arena_free_shift keeps more virtual
|
|
|
|
* memory (in the zio_arena) free, which can avoid memory
|
|
|
|
* fragmentation issues.
|
2015-06-26 21:28:18 +03:00
|
|
|
*/
|
|
|
|
if (zio_arena != NULL) {
|
2016-06-02 07:04:53 +03:00
|
|
|
n = vmem_size(zio_arena, VMEM_FREE) - (vmem_size(zio_arena,
|
|
|
|
VMEM_ALLOC) >> arc_zio_arena_free_shift);
|
2015-06-26 21:28:18 +03:00
|
|
|
if (n < lowest) {
|
|
|
|
lowest = n;
|
|
|
|
r = FMR_ZIO_ARENA;
|
|
|
|
}
|
|
|
|
}
|
2015-07-27 23:17:32 +03:00
|
|
|
#else /* _KERNEL */
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Every 100 calls, free a small amount */
|
|
|
|
if (spa_get_random(100) == 0)
|
|
|
|
lowest = -1024;
|
2015-07-27 23:17:32 +03:00
|
|
|
#endif /* _KERNEL */
|
2015-06-26 21:28:18 +03:00
|
|
|
|
|
|
|
last_free_memory = lowest;
|
|
|
|
last_free_reason = r;
|
|
|
|
|
|
|
|
return (lowest);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine if the system is under memory pressure and is asking
|
2016-06-02 07:04:53 +03:00
|
|
|
* to reclaim memory. A return value of B_TRUE indicates that the system
|
2015-06-26 21:28:18 +03:00
|
|
|
* is under memory pressure and that the arc should adjust accordingly.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
arc_reclaim_needed(void)
|
|
|
|
{
|
|
|
|
return (arc_available_memory() < 0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_kmem_reap_now(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
kmem_cache_t *prev_cache = NULL;
|
|
|
|
kmem_cache_t *prev_data_cache = NULL;
|
|
|
|
extern kmem_cache_t *zio_buf_cache[];
|
|
|
|
extern kmem_cache_t *zio_data_buf_cache[];
|
2015-06-25 01:48:22 +03:00
|
|
|
extern kmem_cache_t *range_seg_cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
if ((arc_meta_used >= arc_meta_limit) && zfs_arc_meta_prune) {
|
|
|
|
/*
|
|
|
|
* We are exceeding our meta-data cache limit.
|
|
|
|
* Prune some entries to release holds on meta-data.
|
|
|
|
*/
|
2015-09-24 01:59:04 +03:00
|
|
|
arc_prune_async(zfs_arc_meta_prune);
|
2015-05-30 17:57:53 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
|
2015-10-31 00:34:22 +03:00
|
|
|
#ifdef _ILP32
|
|
|
|
/* reach upper limit of cache size on 32-bit */
|
|
|
|
if (zio_buf_cache[i] == NULL)
|
|
|
|
break;
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zio_buf_cache[i] != prev_cache) {
|
|
|
|
prev_cache = zio_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_buf_cache[i]);
|
|
|
|
}
|
|
|
|
if (zio_data_buf_cache[i] != prev_data_cache) {
|
|
|
|
prev_data_cache = zio_data_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_data_buf_cache[i]);
|
|
|
|
}
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
kmem_cache_reap_now(buf_cache);
|
2014-12-30 06:12:23 +03:00
|
|
|
kmem_cache_reap_now(hdr_full_cache);
|
|
|
|
kmem_cache_reap_now(hdr_l2only_cache);
|
2015-06-25 01:48:22 +03:00
|
|
|
kmem_cache_reap_now(range_seg_cache);
|
2015-06-26 21:28:18 +03:00
|
|
|
|
|
|
|
if (zio_arena != NULL) {
|
|
|
|
/*
|
|
|
|
* Ask the vmem arena to reclaim unused memory from its
|
|
|
|
* quantum caches.
|
|
|
|
*/
|
|
|
|
vmem_qcache_reap(zio_arena);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Threads can block in arc_get_data_buf() waiting for this thread to evict
|
|
|
|
* enough data and signal them to proceed. When this happens, the threads in
|
|
|
|
* arc_get_data_buf() are sleeping while holding the hash lock for their
|
|
|
|
* particular arc header. Thus, we must be careful to never sleep on a
|
|
|
|
* hash lock in this thread. This is to prevent the following deadlock:
|
|
|
|
*
|
|
|
|
* - Thread A sleeps on CV in arc_get_data_buf() holding hash lock "L",
|
|
|
|
* waiting for the reclaim thread to signal it.
|
|
|
|
*
|
|
|
|
* - arc_reclaim_thread() tries to acquire hash lock "L" using mutex_enter,
|
|
|
|
* fails, and goes to sleep forever.
|
|
|
|
*
|
|
|
|
* This possible deadlock is avoided by always acquiring a hash lock
|
|
|
|
* using mutex_tryenter() from arc_reclaim_thread().
|
2012-03-14 01:29:16 +04:00
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_reclaim_thread(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-06-26 21:28:18 +03:00
|
|
|
fstrans_cookie_t cookie = spl_fstrans_mark();
|
2016-05-06 19:35:52 +03:00
|
|
|
hrtime_t growtime = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
callb_cpr_t cpr;
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
CALLB_CPR_INIT(&cpr, &arc_reclaim_lock, callb_generic_cpr, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&arc_reclaim_lock);
|
2015-06-26 21:28:18 +03:00
|
|
|
while (!arc_reclaim_thread_exit) {
|
|
|
|
int64_t to_free;
|
|
|
|
int64_t free_memory = arc_available_memory();
|
|
|
|
uint64_t evicted = 0;
|
2012-03-14 01:29:16 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_tuning_update();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* This is necessary in order for the mdb ::arc dcmd to
|
|
|
|
* show up to date information. Since the ::arc command
|
|
|
|
* does not call the kstat's update function, without
|
|
|
|
* this call, the command may show stale stats for the
|
|
|
|
* anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even
|
|
|
|
* with this change, the data might be up to 1 second
|
|
|
|
* out of date; but that should suffice. The arc_state_t
|
|
|
|
* structures can be queried directly if more accurate
|
|
|
|
* information is needed.
|
|
|
|
*/
|
|
|
|
#ifndef __linux__
|
|
|
|
if (arc_ksp != NULL)
|
|
|
|
arc_ksp->ks_update(arc_ksp, KSTAT_READ);
|
|
|
|
#endif
|
2015-06-26 21:28:18 +03:00
|
|
|
mutex_exit(&arc_reclaim_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
if (free_memory < 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_no_grow = B_TRUE;
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_warm = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* Wait at least zfs_grow_retry (default 5) seconds
|
|
|
|
* before considering growing.
|
|
|
|
*/
|
2016-05-06 19:35:52 +03:00
|
|
|
growtime = gethrtime() + SEC2NSEC(arc_grow_retry);
|
2011-03-31 05:59:17 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_kmem_reap_now();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* If we are still low on memory, shrink the ARC
|
|
|
|
* so that we have arc_shrink_min free space.
|
|
|
|
*/
|
|
|
|
free_memory = arc_available_memory();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
to_free = (arc_c >> arc_shrink_shift) - free_memory;
|
|
|
|
if (to_free > 0) {
|
|
|
|
#ifdef _KERNEL
|
2015-07-27 23:17:32 +03:00
|
|
|
to_free = MAX(to_free, arc_need_free);
|
2015-06-26 21:28:18 +03:00
|
|
|
#endif
|
|
|
|
arc_shrink(to_free);
|
|
|
|
}
|
|
|
|
} else if (free_memory < arc_c >> arc_no_grow_shift) {
|
|
|
|
arc_no_grow = B_TRUE;
|
2016-05-06 19:35:52 +03:00
|
|
|
} else if (gethrtime() >= growtime) {
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_no_grow = B_FALSE;
|
|
|
|
}
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
evicted = arc_adjust();
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
mutex_enter(&arc_reclaim_lock);
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* If evicted is zero, we couldn't evict anything via
|
|
|
|
* arc_adjust(). This could be due to hash lock
|
|
|
|
* collisions, but more likely due to the majority of
|
|
|
|
* arc buffers being unevictable. Therefore, even if
|
|
|
|
* arc_size is above arc_c, another pass is unlikely to
|
|
|
|
* be helpful and could potentially cause us to enter an
|
|
|
|
* infinite loop.
|
|
|
|
*/
|
|
|
|
if (arc_size <= arc_c || evicted == 0) {
|
|
|
|
/*
|
|
|
|
* We're either no longer overflowing, or we
|
|
|
|
* can't evict anything more, so we should wake
|
2015-07-27 23:17:32 +03:00
|
|
|
* up any threads before we go to sleep and clear
|
|
|
|
* arc_need_free since nothing more can be done.
|
2015-06-26 21:28:18 +03:00
|
|
|
*/
|
|
|
|
cv_broadcast(&arc_reclaim_waiters_cv);
|
2015-07-27 23:17:32 +03:00
|
|
|
arc_need_free = 0;
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* Block until signaled, or after one second (we
|
|
|
|
* might need to perform arc_kmem_reap_now()
|
|
|
|
* even if we aren't being signalled)
|
|
|
|
*/
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2016-05-12 02:55:48 +03:00
|
|
|
(void) cv_timedwait_sig_hires(&arc_reclaim_thread_cv,
|
2016-05-06 19:35:52 +03:00
|
|
|
&arc_reclaim_lock, SEC2NSEC(1), MSEC2NSEC(1), 0);
|
2015-06-26 21:28:18 +03:00
|
|
|
CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_lock);
|
|
|
|
}
|
2015-01-13 06:52:19 +03:00
|
|
|
}
|
2013-07-24 21:14:11 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_reclaim_thread_exit = B_FALSE;
|
2015-01-13 06:52:19 +03:00
|
|
|
cv_broadcast(&arc_reclaim_thread_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_lock */
|
|
|
|
spl_fstrans_unmark(cookie);
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
2012-03-14 01:29:16 +04:00
|
|
|
* Determine the amount of memory eligible for eviction contained in the
|
|
|
|
* ARC. All clean data reported by the ghost lists can always be safely
|
|
|
|
* evicted. Due to arc_c_min, the same does not hold for all clean data
|
|
|
|
* contained by the regular mru and mfu lists.
|
|
|
|
*
|
|
|
|
* In the case of the regular mru and mfu lists, we need to report as
|
|
|
|
* much clean data as possible, such that evicting that same reported
|
|
|
|
* data will not bring arc_size below arc_c_min. Thus, in certain
|
|
|
|
* circumstances, the total amount of clean data in the mru and mfu
|
|
|
|
* lists might not actually be evictable.
|
|
|
|
*
|
|
|
|
* The following two distinct cases are accounted for:
|
|
|
|
*
|
|
|
|
* 1. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is greater than or equal to arc_c_min.
|
|
|
|
* (i.e. amount of dirty data >= arc_c_min)
|
|
|
|
*
|
|
|
|
* This is the easy case; all clean data contained by the mru and mfu
|
|
|
|
* lists is evictable. Evicting all clean data can only drop arc_size
|
|
|
|
* to the amount of dirty data, which is greater than arc_c_min.
|
|
|
|
*
|
|
|
|
* 2. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is less than arc_c_min.
|
|
|
|
* (i.e. arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* 2.1. arc_size is greater than or equal arc_c_min.
|
|
|
|
* (i.e. arc_size >= arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, not all clean data from the regular mru and mfu
|
|
|
|
* lists is actually evictable; we must leave enough clean data
|
|
|
|
* to keep arc_size above arc_c_min. Thus, the maximum amount of
|
|
|
|
* evictable data from the two lists combined, is exactly the
|
|
|
|
* difference between arc_size and arc_c_min.
|
|
|
|
*
|
|
|
|
* 2.2. arc_size is less than arc_c_min
|
|
|
|
* (i.e. arc_c_min > arc_size > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, none of the data contained in the mru and mfu
|
|
|
|
* lists is evictable, even if it's clean. Since arc_size is
|
|
|
|
* already below arc_c_min, evicting any more would only
|
|
|
|
* increase this negative difference.
|
2011-03-30 05:08:59 +04:00
|
|
|
*/
|
2012-03-14 01:29:16 +04:00
|
|
|
static uint64_t
|
|
|
|
arc_evictable_memory(void) {
|
|
|
|
uint64_t arc_clean =
|
2016-06-02 07:04:53 +03:00
|
|
|
refcount_count(&arc_mru->arcs_esize[ARC_BUFC_DATA]) +
|
|
|
|
refcount_count(&arc_mru->arcs_esize[ARC_BUFC_METADATA]) +
|
|
|
|
refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_DATA]) +
|
|
|
|
refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
|
2012-03-14 01:29:16 +04:00
|
|
|
uint64_t ghost_clean =
|
2016-06-02 07:04:53 +03:00
|
|
|
refcount_count(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]) +
|
|
|
|
refcount_count(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]) +
|
|
|
|
refcount_count(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]) +
|
|
|
|
refcount_count(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
2012-03-14 01:29:16 +04:00
|
|
|
uint64_t arc_dirty = MAX((int64_t)arc_size - (int64_t)arc_clean, 0);
|
|
|
|
|
|
|
|
if (arc_dirty >= arc_c_min)
|
|
|
|
return (ghost_clean + arc_clean);
|
|
|
|
|
|
|
|
return (ghost_clean + MAX((int64_t)arc_size - (int64_t)arc_c_min, 0));
|
|
|
|
}
|
|
|
|
|
2014-10-02 16:21:08 +04:00
|
|
|
/*
|
|
|
|
* If sc->nr_to_scan is zero, the caller is requesting a query of the
|
|
|
|
* number of objects which can potentially be freed. If it is nonzero,
|
|
|
|
* the request is to free that many objects.
|
|
|
|
*
|
|
|
|
* Linux kernels >= 3.12 have the count_objects and scan_objects callbacks
|
|
|
|
* in struct shrinker and also require the shrinker to return the number
|
|
|
|
* of objects freed.
|
|
|
|
*
|
|
|
|
* Older kernels require the shrinker to return the number of freeable
|
|
|
|
* objects following the freeing of nr_to_free.
|
|
|
|
*/
|
|
|
|
static spl_shrinker_t
|
2011-06-22 01:26:51 +04:00
|
|
|
__arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
|
2011-03-30 05:08:59 +04:00
|
|
|
{
|
2014-10-02 16:21:08 +04:00
|
|
|
int64_t pages;
|
2011-03-30 05:08:59 +04:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/* The arc is considered warm once reclaim has occurred */
|
|
|
|
if (unlikely(arc_warm == B_FALSE))
|
|
|
|
arc_warm = B_TRUE;
|
2011-03-30 05:08:59 +04:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/* Return the potential number of reclaimable pages */
|
2014-10-02 16:21:08 +04:00
|
|
|
pages = btop((int64_t)arc_evictable_memory());
|
2012-03-14 01:29:16 +04:00
|
|
|
if (sc->nr_to_scan == 0)
|
|
|
|
return (pages);
|
2011-05-09 23:18:46 +04:00
|
|
|
|
|
|
|
/* Not allowed to perform filesystem reclaim */
|
2011-06-22 01:26:51 +04:00
|
|
|
if (!(sc->gfp_mask & __GFP_FS))
|
2014-10-02 16:21:08 +04:00
|
|
|
return (SHRINK_STOP);
|
2011-05-09 23:18:46 +04:00
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
/* Reclaim in progress */
|
2015-01-13 06:52:19 +03:00
|
|
|
if (mutex_tryenter(&arc_reclaim_lock) == 0)
|
2014-10-02 16:21:08 +04:00
|
|
|
return (SHRINK_STOP);
|
2011-03-30 05:08:59 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&arc_reclaim_lock);
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
|
|
|
* Evict the requested number of pages by shrinking arc_c the
|
|
|
|
* requested amount. If there is nothing left to evict just
|
|
|
|
* reap whatever we can from the various arc slabs.
|
|
|
|
*/
|
|
|
|
if (pages > 0) {
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_shrink(ptob(sc->nr_to_scan));
|
|
|
|
arc_kmem_reap_now();
|
2014-10-02 16:21:08 +04:00
|
|
|
#ifdef HAVE_SPLIT_SHRINKER_CALLBACK
|
|
|
|
pages = MAX(pages - btop(arc_evictable_memory()), 0);
|
|
|
|
#else
|
2013-12-23 23:34:20 +04:00
|
|
|
pages = btop(arc_evictable_memory());
|
2014-10-02 16:21:08 +04:00
|
|
|
#endif
|
2012-03-14 01:29:16 +04:00
|
|
|
} else {
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_kmem_reap_now();
|
2014-10-02 16:21:08 +04:00
|
|
|
pages = SHRINK_STOP;
|
2012-03-14 01:29:16 +04:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* We've reaped what we can, wake up threads.
|
|
|
|
*/
|
|
|
|
cv_broadcast(&arc_reclaim_waiters_cv);
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
|
|
|
* When direct reclaim is observed it usually indicates a rapid
|
|
|
|
* increase in memory pressure. This occurs because the kswapd
|
|
|
|
* threads were unable to asynchronously keep enough free memory
|
|
|
|
* available. In this case set arc_no_grow to briefly pause arc
|
|
|
|
* growth to avoid compounding the memory pressure.
|
|
|
|
*/
|
2011-03-30 05:08:59 +04:00
|
|
|
if (current_is_kswapd()) {
|
2012-03-14 01:29:16 +04:00
|
|
|
ARCSTAT_BUMP(arcstat_memory_indirect_count);
|
2011-03-30 05:08:59 +04:00
|
|
|
} else {
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_no_grow = B_TRUE;
|
2015-07-27 23:17:32 +03:00
|
|
|
arc_need_free = ptob(sc->nr_to_scan);
|
2012-03-14 01:29:16 +04:00
|
|
|
ARCSTAT_BUMP(arcstat_memory_direct_count);
|
2011-03-30 05:08:59 +04:00
|
|
|
}
|
|
|
|
|
2013-12-23 23:34:20 +04:00
|
|
|
return (pages);
|
2011-03-30 05:08:59 +04:00
|
|
|
}
|
2011-06-22 01:26:51 +04:00
|
|
|
SPL_SHRINKER_CALLBACK_WRAPPER(arc_shrinker_func);
|
2011-03-30 05:08:59 +04:00
|
|
|
|
|
|
|
SPL_SHRINKER_DECLARE(arc_shrinker, arc_shrinker_func, DEFAULT_SEEKS);
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Adapt arc info given the number of bytes we are trying to add and
|
|
|
|
* the state that we are comming from. This function is only called
|
|
|
|
* when we are adding new content to the cache.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_adapt(int bytes, arc_state_t *state)
|
|
|
|
{
|
|
|
|
int mult;
|
2015-06-27 01:59:23 +03:00
|
|
|
uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
|
2015-06-27 01:14:45 +03:00
|
|
|
int64_t mrug_size = refcount_count(&arc_mru_ghost->arcs_size);
|
|
|
|
int64_t mfug_size = refcount_count(&arc_mfu_ghost->arcs_size);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (state == arc_l2c_only)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(bytes > 0);
|
|
|
|
/*
|
|
|
|
* Adapt the target size of the MRU list:
|
|
|
|
* - if we just hit in the MRU ghost list, then increase
|
|
|
|
* the target size of the MRU list.
|
|
|
|
* - if we just hit in the MFU ghost list, then increase
|
|
|
|
* the target size of the MFU list by decreasing the
|
|
|
|
* target size of the MRU list.
|
|
|
|
*/
|
|
|
|
if (state == arc_mru_ghost) {
|
2015-06-27 01:14:45 +03:00
|
|
|
mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size);
|
2014-01-03 22:36:26 +04:00
|
|
|
if (!zfs_arc_p_dampener_disable)
|
|
|
|
mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-27 01:59:23 +03:00
|
|
|
arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (state == arc_mfu_ghost) {
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t delta;
|
|
|
|
|
2015-06-27 01:14:45 +03:00
|
|
|
mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size);
|
2014-01-03 22:36:26 +04:00
|
|
|
if (!zfs_arc_p_dampener_disable)
|
|
|
|
mult = MIN(mult, 10);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
delta = MIN(bytes * mult, arc_p);
|
2015-06-27 01:59:23 +03:00
|
|
|
arc_p = MAX(arc_p_min, arc_p - delta);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
if (arc_reclaim_needed()) {
|
|
|
|
cv_signal(&arc_reclaim_thread_cv);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (arc_no_grow)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (arc_c >= arc_c_max)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're within (2 * maxblocksize) bytes of the target
|
|
|
|
* cache size, increment the target cache size
|
|
|
|
*/
|
2015-10-13 19:17:01 +03:00
|
|
|
ASSERT3U(arc_c, >=, 2ULL << SPA_MAXBLOCKSHIFT);
|
2015-06-04 16:06:27 +03:00
|
|
|
if (arc_size >= arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
atomic_add_64(&arc_c, (int64_t)bytes);
|
|
|
|
if (arc_c > arc_c_max)
|
|
|
|
arc_c = arc_c_max;
|
|
|
|
else if (state == arc_anon)
|
|
|
|
atomic_add_64(&arc_p, (int64_t)bytes);
|
|
|
|
if (arc_p > arc_c)
|
|
|
|
arc_p = arc_c;
|
|
|
|
}
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Check if arc_size has grown past our upper threshold, determined by
|
|
|
|
* zfs_arc_overflow_shift.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
static boolean_t
|
|
|
|
arc_is_overflowing(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
/* Always allow at least one block of overflow */
|
|
|
|
uint64_t overflow = MAX(SPA_MAXBLOCKSIZE,
|
|
|
|
arc_c >> zfs_arc_overflow_shift);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
return (arc_size >= arc_c + overflow);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Allocate a block and return it to the caller. If we are hitting the
|
|
|
|
* hard limit for the cache size, we must sleep, waiting for the eviction
|
|
|
|
* thread to catch up. If we're past the target size but below the hard
|
|
|
|
* limit, we'll only signal the reclaim thread and continue on.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
static void *
|
|
|
|
arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
void *datap = NULL;
|
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
arc_adapt(size, state);
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* If arc_size is currently overflowing, and has grown past our
|
|
|
|
* upper limit, we must be adding data faster than the evict
|
|
|
|
* thread can evict. Thus, to ensure we don't compound the
|
|
|
|
* problem by adding more data and forcing arc_size to grow even
|
|
|
|
* further past it's target size, we halt and wait for the
|
|
|
|
* eviction thread to catch up.
|
|
|
|
*
|
|
|
|
* It's also possible that the reclaim thread is unable to evict
|
|
|
|
* enough buffers to get arc_size below the overflow limit (e.g.
|
|
|
|
* due to buffers being un-evictable, or hash lock collisions).
|
|
|
|
* In this case, we want to proceed regardless if we're
|
|
|
|
* overflowing; thus we don't use a while loop here.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (arc_is_overflowing()) {
|
|
|
|
mutex_enter(&arc_reclaim_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we've acquired the lock, we may no longer be
|
|
|
|
* over the overflow limit, lets check.
|
|
|
|
*
|
|
|
|
* We're ignoring the case of spurious wake ups. If that
|
|
|
|
* were to happen, it'd let this thread consume an ARC
|
|
|
|
* buffer before it should have (i.e. before we're under
|
|
|
|
* the overflow limit and were signalled by the reclaim
|
|
|
|
* thread). As long as that is a rare occurrence, it
|
|
|
|
* shouldn't cause any harm.
|
|
|
|
*/
|
|
|
|
if (arc_is_overflowing()) {
|
|
|
|
cv_signal(&arc_reclaim_thread_cv);
|
|
|
|
cv_wait(&arc_reclaim_waiters_cv, &arc_reclaim_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&arc_reclaim_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (type == ARC_BUFC_METADATA) {
|
2016-06-02 07:04:53 +03:00
|
|
|
datap = zio_buf_alloc(size);
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_space_consume(size, ARC_SPACE_META);
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
2016-06-02 07:04:53 +03:00
|
|
|
datap = zio_data_buf_alloc(size);
|
2015-01-13 06:52:19 +03:00
|
|
|
arc_space_consume(size, ARC_SPACE_DATA);
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Update the state size. Note that ghost states have a
|
|
|
|
* "ghost size" and so don't need to be updated.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!GHOST_STATE(state)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_add_many(&state->arcs_size, size, tag);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this is reached via arc_read, the link is
|
|
|
|
* protected by the hash lock. If reached via
|
|
|
|
* arc_buf_alloc, the header should not be accessed by
|
|
|
|
* any other thread. And, if reached via arc_read_done,
|
|
|
|
* the hash lock will protect it if it's found in the
|
|
|
|
* hash table; otherwise no other thread should be
|
|
|
|
* trying to [add|remove]_reference it.
|
|
|
|
*/
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_add_many(&state->arcs_esize[type],
|
|
|
|
size, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If we are growing the cache, and we are adding anonymous
|
|
|
|
* data, and we have outgrown arc_p, update arc_p
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
if (arc_size < arc_c && hdr->b_l1hdr.b_state == arc_anon &&
|
2015-06-27 01:14:45 +03:00
|
|
|
(refcount_count(&arc_anon->arcs_size) +
|
|
|
|
refcount_count(&arc_mru->arcs_size) > arc_p))
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_p = MIN(arc_c, arc_p + size);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
return (datap);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free the arc data buffer.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_free_data_buf(arc_buf_hdr_t *hdr, void *data, uint64_t size, void *tag)
|
|
|
|
{
|
|
|
|
arc_state_t *state = hdr->b_l1hdr.b_state;
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
|
|
|
|
/* protected by hash lock, if in the hash table */
|
|
|
|
if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
|
|
|
ASSERT(state != arc_anon && state != arc_l2c_only);
|
|
|
|
|
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
|
|
|
size, tag);
|
|
|
|
}
|
|
|
|
(void) refcount_remove_many(&state->arcs_size, size, tag);
|
|
|
|
|
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
zio_buf_free(data, size);
|
|
|
|
arc_space_return(size, ARC_SPACE_META);
|
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
zio_data_buf_free(data, size);
|
|
|
|
arc_space_return(size, ARC_SPACE_DATA);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This routine is called whenever a buffer is accessed.
|
|
|
|
* NOTE: the hash lock is dropped in this function.
|
|
|
|
*/
|
|
|
|
static void
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t now;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This buffer is not in the cache, and does not
|
|
|
|
* appear in our "ghost" list. Add the new buffer
|
|
|
|
* to the MRU state.
|
|
|
|
*/
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT0(hdr->b_l1hdr.b_arc_access);
|
|
|
|
hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(arc_mru, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mru) {
|
2010-05-29 00:45:14 +04:00
|
|
|
now = ddi_get_lbolt();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If this buffer is here because of a prefetch, then either:
|
|
|
|
* - clear the flag if this is a "referencing" read
|
|
|
|
* (any subsequent access will bump this into the MFU state).
|
|
|
|
* or
|
|
|
|
* - move the buffer to the head of the list if this is
|
|
|
|
* another prefetch (to make it less likely to be evicted).
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_PREFETCH(hdr)) {
|
|
|
|
if (refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
|
2015-01-13 06:52:19 +03:00
|
|
|
/* link protected by hash lock */
|
|
|
|
ASSERT(multilist_link_active(
|
2014-12-30 06:12:23 +03:00
|
|
|
&hdr->b_l1hdr.b_arc_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l1hdr.b_mru_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_hits);
|
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This buffer has been "accessed" only once so far,
|
|
|
|
* but it is still in the cache. Move it to the MFU
|
|
|
|
* state.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +
|
|
|
|
ARC_MINTIME)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* More than 125ms have passed since we
|
|
|
|
* instantiated this buffer. Move it to the
|
|
|
|
* most frequently used state.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = now;
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(arc_mfu, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l1hdr.b_mru_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_state_t *new_state;
|
|
|
|
/*
|
|
|
|
* This buffer has been "accessed" recently, but
|
|
|
|
* was evicted from the cache. Move it to the
|
|
|
|
* MFU state.
|
|
|
|
*/
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_PREFETCH(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
new_state = arc_mru;
|
2014-12-30 06:12:23 +03:00
|
|
|
if (refcount_count(&hdr->b_l1hdr.b_refcnt) > 0)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH);
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
new_state = arc_mfu;
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_change_state(new_state, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l1hdr.b_mru_ghost_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_ghost_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mfu) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This buffer has been accessed more than once and is
|
|
|
|
* still in the cache. Keep it in the MFU state.
|
|
|
|
*
|
|
|
|
* NOTE: an add_reference() that occurred when we did
|
|
|
|
* the arc_read() will have kicked this off the list.
|
|
|
|
* If it was a prefetch, we will explicitly move it to
|
|
|
|
* the head of the list now.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if ((HDR_PREFETCH(hdr)) != 0) {
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2015-01-13 06:52:19 +03:00
|
|
|
/* link protected by hash_lock */
|
|
|
|
ASSERT(multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l1hdr.b_mfu_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mfu_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
|
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_state_t *new_state = arc_mfu;
|
|
|
|
/*
|
|
|
|
* This buffer has been accessed more than once but has
|
|
|
|
* been evicted from the cache. Move it back to the
|
|
|
|
* MFU state.
|
|
|
|
*/
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_PREFETCH(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This is a prefetch access...
|
|
|
|
* move this block back to the MRU state.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT0(refcount_count(&hdr->b_l1hdr.b_refcnt));
|
2008-11-20 23:01:55 +03:00
|
|
|
new_state = arc_mru;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(new_state, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l1hdr.b_mfu_ghost_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
} else if (hdr->b_l1hdr.b_state == arc_l2c_only) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This buffer is on the 2nd Level ARC.
|
|
|
|
*/
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
|
2014-12-06 20:24:32 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_change_state(arc_mfu, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
cmn_err(CE_PANIC, "invalid arc state 0x%p",
|
|
|
|
hdr->b_l1hdr.b_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* a generic arc_done_func_t which you can use */
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio == NULL || zio->io_error == 0)
|
2016-07-11 20:45:52 +03:00
|
|
|
bcopy(buf->b_data, arg, arc_buf_size(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, arg);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* a generic arc_done_func_t */
|
|
|
|
void
|
|
|
|
arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
|
|
|
|
{
|
|
|
|
arc_buf_t **bufp = arg;
|
|
|
|
if (zio && zio->io_error) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(buf, arg);
|
2008-11-20 23:01:55 +03:00
|
|
|
*bufp = NULL;
|
|
|
|
} else {
|
|
|
|
*bufp = buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_data);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)
|
|
|
|
{
|
|
|
|
if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {
|
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0);
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);
|
|
|
|
} else {
|
|
|
|
if (HDR_COMPRESSION_ENABLED(hdr)) {
|
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), ==,
|
|
|
|
BP_GET_COMPRESS(bp));
|
|
|
|
}
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));
|
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_read_done(zio_t *zio)
|
|
|
|
{
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_hdr_t *hdr = zio->io_private;
|
2014-06-06 01:19:08 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2016-07-14 00:17:41 +03:00
|
|
|
arc_callback_t *callback_list;
|
|
|
|
arc_callback_t *acb;
|
2016-07-11 20:45:52 +03:00
|
|
|
boolean_t freeable = B_FALSE;
|
2016-07-14 00:17:41 +03:00
|
|
|
boolean_t no_zio_error = (zio->io_error == 0);
|
2016-07-11 20:45:52 +03:00
|
|
|
int callback_cnt = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* The hdr was inserted into hash-table and removed from lists
|
|
|
|
* prior to starting I/O. We should find this header, since
|
|
|
|
* it's in the hash table, and it should be legit since it's
|
|
|
|
* not possible to evict it during the I/O. The only possible
|
|
|
|
* reason for it not to be found is if we were freed during the
|
|
|
|
* read.
|
|
|
|
*/
|
2014-06-06 01:19:08 +04:00
|
|
|
if (HDR_IN_HASH_TABLE(hdr)) {
|
|
|
|
arc_buf_hdr_t *found;
|
|
|
|
|
|
|
|
ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[0], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[0]);
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[1], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[1]);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock);
|
2014-06-06 01:19:08 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT((found == hdr &&
|
2014-06-06 01:19:08 +04:00
|
|
|
DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
|
|
|
|
(found == hdr && HDR_L2_READING(hdr)));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hash_lock, !=, NULL);
|
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (no_zio_error) {
|
2016-06-02 07:04:53 +03:00
|
|
|
/* byteswap if necessary */
|
|
|
|
if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
|
|
|
|
if (BP_GET_LEVEL(zio->io_bp) > 0) {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap =
|
|
|
|
DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;
|
|
|
|
}
|
2014-06-06 01:19:08 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (l2arc_noprefetch && HDR_PREFETCH(hdr))
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
callback_list = hdr->b_l1hdr.b_acb;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(callback_list, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (hash_lock && no_zio_error && hdr->b_l1hdr.b_state == arc_anon) {
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Only call arc_access on anonymous buffers. This is because
|
|
|
|
* if we've issued an I/O for an evicted buffer, we've already
|
|
|
|
* called arc_access (to prevent any simultaneous readers from
|
|
|
|
* getting confused).
|
|
|
|
*/
|
|
|
|
arc_access(hdr, hash_lock);
|
|
|
|
}
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/*
|
|
|
|
* If a read request has a callback (i.e. acb_done is not NULL), then we
|
|
|
|
* make a buf containing the data according to the parameters which were
|
|
|
|
* passed in. The implementation of arc_buf_alloc_impl() ensures that we
|
|
|
|
* aren't needlessly decompressing the data multiple times.
|
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
for (acb = callback_list; acb != NULL; acb = acb->acb_next) {
|
2016-07-14 00:17:41 +03:00
|
|
|
int error;
|
2016-07-11 20:45:52 +03:00
|
|
|
if (!acb->acb_done)
|
|
|
|
continue;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* This is a demand read since prefetches don't use callbacks */
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
callback_cnt++;
|
2016-07-14 00:17:41 +03:00
|
|
|
|
|
|
|
error = arc_buf_alloc_impl(hdr, acb->acb_private,
|
|
|
|
acb->acb_compressed, no_zio_error, &acb->acb_buf);
|
|
|
|
if (no_zio_error) {
|
|
|
|
zio->io_error = error;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_acb = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
2016-07-11 20:45:52 +03:00
|
|
|
if (callback_cnt == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_PREFETCH(hdr));
|
|
|
|
ASSERT0(hdr->b_l1hdr.b_bufcnt);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt) ||
|
|
|
|
callback_list != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (no_zio_error) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_verify(hdr, zio->io_bp);
|
|
|
|
} else {
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_l1hdr.b_state != arc_anon)
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
if (HDR_IN_HASH_TABLE(hdr))
|
|
|
|
buf_hash_remove(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Broadcast before we drop the hash_lock to avoid the possibility
|
|
|
|
* that the hdr (and hence the cv) might be freed before we get to
|
|
|
|
* the cv_broadcast().
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
cv_broadcast(&hdr->b_l1hdr.b_cv);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hash_lock != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This block was freed while we waited for the read to
|
|
|
|
* complete. It has been removed from the hash table and
|
|
|
|
* moved to the anonymous state (so that it won't show up
|
|
|
|
* in the cache).
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);
|
|
|
|
freeable = refcount_is_zero(&hdr->b_l1hdr.b_refcnt);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* execute each callback and free its structure */
|
|
|
|
while ((acb = callback_list) != NULL) {
|
|
|
|
if (acb->acb_done)
|
|
|
|
acb->acb_done(zio, acb->acb_buf, acb->acb_private);
|
|
|
|
|
|
|
|
if (acb->acb_zio_dummy != NULL) {
|
|
|
|
acb->acb_zio_dummy->io_error = zio->io_error;
|
|
|
|
zio_nowait(acb->acb_zio_dummy);
|
|
|
|
}
|
|
|
|
|
|
|
|
callback_list = acb->acb_next;
|
|
|
|
kmem_free(acb, sizeof (arc_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (freeable)
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-11 20:54:18 +04:00
|
|
|
* "Read" the block at the specified DVA (in bp) via the
|
2008-11-20 23:01:55 +03:00
|
|
|
* cache. If the block is found in the cache, invoke the provided
|
|
|
|
* callback immediately and return. Note that the `zio' parameter
|
|
|
|
* in the callback will be NULL in this case, since no IO was
|
|
|
|
* required. If the block is not in the cache pass the read request
|
|
|
|
* on to the spa with a substitute callback function, so that the
|
|
|
|
* requested block will be added to the cache.
|
|
|
|
*
|
|
|
|
* If a read request arrives for a block that has a read in-progress,
|
|
|
|
* either wait for the in-progress read to complete (and return the
|
|
|
|
* results); or, if this is a read with a "done" func, add a record
|
|
|
|
* to the read to invoke the "done" func when the read completes,
|
|
|
|
* and return; or just return.
|
|
|
|
*
|
|
|
|
* arc_read_done() will invoke all the requested "done" functions
|
|
|
|
* for readers of this block.
|
|
|
|
*/
|
|
|
|
int
|
2013-07-03 00:26:24 +04:00
|
|
|
arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
|
2014-12-06 20:24:32 +03:00
|
|
|
void *private, zio_priority_t priority, int zio_flags,
|
|
|
|
arc_flags_t *arc_flags, const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *hdr = NULL;
|
|
|
|
kmutex_t *hash_lock = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *rzio;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t guid = spa_load_guid(spa);
|
2016-07-11 20:45:52 +03:00
|
|
|
boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW) != 0;
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
int rc = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp) ||
|
|
|
|
BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
top:
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
/*
|
|
|
|
* Embedded BP's have no DVA and require no I/O to "read".
|
|
|
|
* Create an anonymous arc buf to back it.
|
|
|
|
*/
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr != NULL && HDR_HAS_L1HDR(hdr) && hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
arc_buf_t *buf = NULL;
|
2014-12-06 20:24:32 +03:00
|
|
|
*arc_flags |= ARC_FLAG_CACHED;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
|
2015-12-27 00:10:31 +03:00
|
|
|
if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&
|
|
|
|
priority == ZIO_PRIORITY_SYNC_READ) {
|
|
|
|
/*
|
|
|
|
* This sync read must wait for an
|
|
|
|
* in-progress async read (e.g. a predictive
|
|
|
|
* prefetch). Async reads are queued
|
|
|
|
* separately at the vdev_queue layer, so
|
|
|
|
* this is a form of priority inversion.
|
|
|
|
* Ideally, we would "inherit" the demand
|
|
|
|
* i/o's priority by moving the i/o from
|
|
|
|
* the async queue to the synchronous queue,
|
|
|
|
* but there is currently no mechanism to do
|
|
|
|
* so. Track this so that we can evaluate
|
|
|
|
* the magnitude of this potential performance
|
|
|
|
* problem.
|
|
|
|
*
|
|
|
|
* Note that if the prefetch i/o is already
|
|
|
|
* active (has been issued to the device),
|
|
|
|
* the prefetch improved performance, because
|
|
|
|
* we issued it sooner than we would have
|
|
|
|
* without the prefetch.
|
|
|
|
*/
|
|
|
|
DTRACE_PROBE1(arc__sync__wait__for__async,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_sync_wait_for_async);
|
|
|
|
}
|
|
|
|
if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr,
|
|
|
|
ARC_FLAG_PREDICTIVE_PREFETCH);
|
2015-12-27 00:10:31 +03:00
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_WAIT) {
|
2014-12-30 06:12:23 +03:00
|
|
|
cv_wait(&hdr->b_l1hdr.b_cv, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
|
|
|
}
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (done) {
|
2015-12-27 00:10:31 +03:00
|
|
|
arc_callback_t *acb = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
|
|
|
if (pio != NULL)
|
|
|
|
acb->acb_zio_dummy = zio_null(pio,
|
2009-02-18 23:51:31 +03:00
|
|
|
spa, NULL, NULL, NULL, zio_flags);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(acb->acb_done, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
acb->acb_next = hdr->b_l1hdr.b_acb;
|
|
|
|
hdr->b_l1hdr.b_acb = acb;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_mru ||
|
|
|
|
hdr->b_l1hdr.b_state == arc_mfu);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (done) {
|
2015-12-27 00:10:31 +03:00
|
|
|
if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) {
|
|
|
|
/*
|
|
|
|
* This is a demand read which does not have to
|
|
|
|
* wait for i/o because we did a predictive
|
|
|
|
* prefetch i/o for it, which has completed.
|
|
|
|
*/
|
|
|
|
DTRACE_PROBE1(
|
|
|
|
arc__demand__hit__predictive__prefetch,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(
|
|
|
|
arcstat_demand_hit_predictive_prefetch);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr,
|
|
|
|
ARC_FLAG_PREDICTIVE_PREFETCH);
|
2015-12-27 00:10:31 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp) || !BP_IS_HOLE(bp));
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
/* Get a buf with the desired data in it. */
|
|
|
|
VERIFY0(arc_buf_alloc_impl(hdr, private,
|
|
|
|
compressed_read, B_TRUE, &buf));
|
2014-12-06 20:24:32 +03:00
|
|
|
} else if (*arc_flags & ARC_FLAG_PREFETCH &&
|
2014-12-30 06:12:23 +03:00
|
|
|
refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_access(hdr, hash_lock);
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_L2CACHE)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
|
|
|
|
demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
|
2008-11-20 23:01:55 +03:00
|
|
|
data, metadata, hits);
|
|
|
|
|
|
|
|
if (done)
|
|
|
|
done(NULL, buf, private);
|
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t lsize = BP_GET_LSIZE(bp);
|
|
|
|
uint64_t psize = BP_GET_PSIZE(bp);
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_callback_t *acb;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd = NULL;
|
2013-02-11 10:21:05 +04:00
|
|
|
uint64_t addr = 0;
|
2009-02-18 23:51:31 +03:00
|
|
|
boolean_t devw = B_FALSE;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-09-10 22:59:03 +04:00
|
|
|
/*
|
|
|
|
* Gracefully handle a damaged logical block size as a
|
2015-12-09 22:00:35 +03:00
|
|
|
* checksum error.
|
2014-09-10 22:59:03 +04:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (lsize > spa_maxblocksize(spa)) {
|
2015-12-09 22:00:35 +03:00
|
|
|
rc = SET_ERROR(ECKSUM);
|
2014-09-10 22:59:03 +04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (hdr == NULL) {
|
|
|
|
/* this block is not in the cache */
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *exists = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
|
|
|
|
BP_GET_COMPRESS(bp), type);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
hdr->b_dva = *BP_IDENTITY(bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
}
|
|
|
|
if (exists != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/* somebody beat us to the hash insert */
|
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
goto top; /* restart the IO request */
|
|
|
|
}
|
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
|
|
|
* This block is in the ghost cache. If it was L2-only
|
|
|
|
* (and thus didn't have an L1 hdr), we realloc the
|
|
|
|
* header to add an L1 hdr.
|
|
|
|
*/
|
|
|
|
if (!HDR_HAS_L1HDR(hdr)) {
|
|
|
|
hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,
|
|
|
|
hdr_full_cache);
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(GHOST_STATE(hdr->b_l1hdr.b_state));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL);
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-12-27 00:10:31 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* This is a delicate dance that we play here.
|
|
|
|
* This hdr is in the ghost list so we access it
|
|
|
|
* to move it out of the ghost list before we
|
|
|
|
* initiate the read. If it's a prefetch then
|
|
|
|
* it won't have a callback so we'll remove the
|
|
|
|
* reference that arc_buf_alloc_impl() created. We
|
|
|
|
* do this after we've called arc_access() to
|
|
|
|
* avoid hitting an assert in remove_reference().
|
2015-12-27 00:10:31 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_access(hdr, hash_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_alloc_pdata(hdr);
|
|
|
|
}
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
size = arc_hdr_size(hdr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If compression is enabled on the hdr, then will do
|
|
|
|
* RAW I/O and will store the compressed data in the hdr's
|
|
|
|
* data block. Otherwise, the hdr's data block will contain
|
|
|
|
* the uncompressed data.
|
|
|
|
*/
|
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF) {
|
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_PREFETCH)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);
|
|
|
|
if (*arc_flags & ARC_FLAG_L2CACHE)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
|
|
|
|
if (BP_GET_LEVEL(bp) > 0)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);
|
2015-12-27 00:10:31 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
2016-07-11 20:45:52 +03:00
|
|
|
acb->acb_compressed = compressed_read;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_acb = acb;
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) &&
|
|
|
|
(vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
|
|
|
|
devw = hdr->b_l2hdr.b_dev->l2ad_writing;
|
|
|
|
addr = hdr->b_l2hdr.b_daddr;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Lock out device removal.
|
|
|
|
*/
|
|
|
|
if (vdev_is_dead(vd) ||
|
|
|
|
!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
|
|
|
|
vd = NULL;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if (priority == ZIO_PRIORITY_ASYNC_READ)
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
|
|
|
|
else
|
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* At this point, we have a level 1 cache miss. Try again in
|
|
|
|
* L2ARC if possible.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t, lsize, zbookmark_phys_t *, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_misses);
|
2014-12-30 06:12:23 +03:00
|
|
|
ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr),
|
|
|
|
demand, prefetch, !HDR_ISTYPE_METADATA(hdr),
|
2008-11-20 23:01:55 +03:00
|
|
|
data, metadata, misses);
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Read from the L2ARC if the following are true:
|
2008-12-03 23:09:06 +03:00
|
|
|
* 1. The L2ARC vdev was previously cached.
|
|
|
|
* 2. This buffer still has L2ARC metadata.
|
|
|
|
* 3. This buffer isn't currently writing to the L2ARC.
|
|
|
|
* 4. The L2ARC entry wasn't evicted, which may
|
|
|
|
* also have invalidated the vdev.
|
2009-02-18 23:51:31 +03:00
|
|
|
* 5. This isn't prefetch and l2arc_noprefetch is set.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr) &&
|
2009-02-18 23:51:31 +03:00
|
|
|
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
|
|
|
|
!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
|
|
|
|
DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_hits);
|
2014-12-30 06:12:23 +03:00
|
|
|
atomic_inc_32(&hdr->b_l2hdr.b_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
|
2014-11-21 03:09:39 +03:00
|
|
|
KM_SLEEP);
|
2016-06-02 07:04:53 +03:00
|
|
|
cb->l2rcb_hdr = hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2rcb_bp = *bp;
|
|
|
|
cb->l2rcb_zb = *zb;
|
2008-12-03 23:09:06 +03:00
|
|
|
cb->l2rcb_flags = zio_flags;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-02-11 10:21:05 +04:00
|
|
|
ASSERT(addr >= VDEV_LABEL_START_SIZE &&
|
2016-06-02 07:04:53 +03:00
|
|
|
addr + lsize < vd->vdev_psize -
|
2013-02-11 10:21:05 +04:00
|
|
|
VDEV_LABEL_END_SIZE);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc read. The SCL_L2ARC lock will be
|
|
|
|
* released by l2arc_read_done().
|
2013-08-02 00:02:10 +04:00
|
|
|
* Issue a null zio if the underlying buffer
|
|
|
|
* was squashed to zero size by compression.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_COMPRESS(hdr), !=,
|
|
|
|
ZIO_COMPRESS_EMPTY);
|
|
|
|
rzio = zio_read_phys(pio, vd, addr,
|
|
|
|
size, hdr->b_l1hdr.b_pdata,
|
|
|
|
ZIO_CHECKSUM_OFF,
|
|
|
|
l2arc_read_done, cb, priority,
|
|
|
|
zio_flags | ZIO_FLAG_DONT_CACHE |
|
|
|
|
ZIO_FLAG_CANFAIL |
|
|
|
|
ZIO_FLAG_DONT_PROPAGATE |
|
|
|
|
ZIO_FLAG_DONT_RETRY, B_FALSE);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
|
|
|
|
zio_t *, rzio);
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_read_bytes, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_NOWAIT) {
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_nowait(rzio);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(*arc_flags & ARC_FLAG_WAIT);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio_wait(rzio) == 0)
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* l2arc read error; goto zio_read() */
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
if (HDR_L2_WRITING(hdr))
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rw_clash);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
} else {
|
|
|
|
if (vd != NULL)
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
|
|
|
if (l2arc_ndev != 0) {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
rzio = zio_read(pio, spa, bp, hdr->b_l1hdr.b_pdata, size,
|
|
|
|
arc_read_done, hdr, priority, zio_flags, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (*arc_flags & ARC_FLAG_WAIT) {
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
rc = zio_wait(rzio);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(*arc_flags & ARC_FLAG_NOWAIT);
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_nowait(rzio);
|
|
|
|
}
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
|
|
|
|
out:
|
|
|
|
spa_read_history_add(spa, zb, *arc_flags);
|
|
|
|
return (rc);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *
|
|
|
|
arc_add_prune_callback(arc_prune_func_t *func, void *private)
|
|
|
|
{
|
|
|
|
arc_prune_t *p;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
p = kmem_alloc(sizeof (*p), KM_SLEEP);
|
2011-12-23 00:20:43 +04:00
|
|
|
p->p_pfunc = func;
|
|
|
|
p->p_private = private;
|
|
|
|
list_link_init(&p->p_node);
|
|
|
|
refcount_create(&p->p_refcnt);
|
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
refcount_add(&p->p_refcnt, &arc_prune_list);
|
|
|
|
list_insert_head(&arc_prune_list, p);
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
return (p);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_remove_prune_callback(arc_prune_t *p)
|
|
|
|
{
|
2016-05-23 21:58:21 +03:00
|
|
|
boolean_t wait = B_FALSE;
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
list_remove(&arc_prune_list, p);
|
2016-05-23 21:58:21 +03:00
|
|
|
if (refcount_remove(&p->p_refcnt, &arc_prune_list) > 0)
|
|
|
|
wait = B_TRUE;
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_exit(&arc_prune_mtx);
|
2016-05-23 21:58:21 +03:00
|
|
|
|
|
|
|
/* wait for arc_prune_task to finish */
|
|
|
|
if (wait)
|
|
|
|
taskq_wait_outstanding(arc_prune_taskq, 0);
|
|
|
|
ASSERT0(refcount_count(&p->p_refcnt));
|
|
|
|
refcount_destroy(&p->p_refcnt);
|
|
|
|
kmem_free(p, sizeof (*p));
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
/*
|
|
|
|
* Notify the arc that a block was freed, and thus will never be used again.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_freed(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t guid = spa_load_guid(spa);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
|
|
|
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
if (hdr == NULL)
|
|
|
|
return;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* We might be trying to free a block that is still doing I/O
|
|
|
|
* (i.e. prefetch) or has a reference (i.e. a dedup-ed,
|
|
|
|
* dmu_sync-ed block). If this block is being prefetched, then it
|
|
|
|
* would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr
|
|
|
|
* until the I/O completes. A block may also have a reference if it is
|
|
|
|
* part of a dedup-ed, dmu_synced write. The dmu_sync() function would
|
|
|
|
* have written the new block to its final resting place on disk but
|
|
|
|
* without the dedup flag set. This would have left the hdr in the MRU
|
|
|
|
* state and discoverable. When the txg finally syncs it detects that
|
|
|
|
* the block was overridden in open context and issues an override I/O.
|
|
|
|
* Since this is a dedup block, the override I/O will determine if the
|
|
|
|
* block is already in the DDT. If so, then it will replace the io_bp
|
|
|
|
* with the bp from the DDT and allow the I/O to finish. When the I/O
|
|
|
|
* reaches the done callback, dbuf_write_override_done, it will
|
|
|
|
* check to see if the io_bp and io_bp_override are identical.
|
|
|
|
* If they are not, then it indicates that the bp was replaced with
|
|
|
|
* the bp in the DDT and the override bp is freed. This allows
|
|
|
|
* us to arrive here with a reference on a block that is being
|
|
|
|
* freed. So if we have an I/O in progress, or a reference to
|
|
|
|
* this hdr, then we don't destroy the hdr.
|
|
|
|
*/
|
|
|
|
if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) &&
|
|
|
|
refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) {
|
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
arc_hdr_destroy(hdr);
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
mutex_exit(hash_lock);
|
2014-07-15 11:43:18 +04:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Release this buffer from the cache, making it an anonymous buffer. This
|
|
|
|
* must be done after a read and prior to modifying the buffer contents.
|
2008-11-20 23:01:55 +03:00
|
|
|
* If the buffer has more than one reference, we must make
|
2008-12-03 23:09:06 +03:00
|
|
|
* a new hdr for the buffer.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_release(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
2014-12-30 06:12:23 +03:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
arc_state_t *state;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* It would be nice to assert that if its DMU metadata (level >
|
2010-05-29 00:45:14 +04:00
|
|
|
* 0 || it's the dnode file), then it must be syncing context.
|
|
|
|
* But we don't know that information at this level.
|
|
|
|
*/
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
|
|
|
* We don't grab the hash lock prior to this check, because if
|
|
|
|
* the buffer's header is in the arc_anon state, it won't be
|
|
|
|
* linked into the hash table.
|
|
|
|
*/
|
|
|
|
if (hdr->b_l1hdr.b_state == arc_anon) {
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
|
|
|
ASSERT(!HDR_HAS_L2HDR(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);
|
|
|
|
ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node));
|
|
|
|
|
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the buf is being overridden then it may already
|
|
|
|
* have a hdr that is not empty.
|
|
|
|
*/
|
|
|
|
buf_discard_identity(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
|
|
|
|
return;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This assignment is only valid as long as the hash_lock is
|
|
|
|
* held, we must be careful not to reference state or the
|
|
|
|
* b_state field after dropping the lock.
|
|
|
|
*/
|
|
|
|
state = hdr->b_l1hdr.b_state;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
|
|
|
ASSERT3P(state, !=, arc_anon);
|
|
|
|
|
|
|
|
/* this buffer is not on any list */
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3S(refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
|
|
|
if (HDR_HAS_L2HDR(hdr)) {
|
|
|
|
mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
2015-06-16 02:12:19 +03:00
|
|
|
* We have to recheck this conditional again now that
|
|
|
|
* we're holding the l2ad_mtx to prevent a race with
|
|
|
|
* another thread which might be concurrently calling
|
|
|
|
* l2arc_evict(). In that case, l2arc_evict() might have
|
|
|
|
* destroyed the header's L2 portion as we were waiting
|
|
|
|
* to acquire the l2ad_mtx.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
2015-06-16 02:12:19 +03:00
|
|
|
if (HDR_HAS_L2HDR(hdr))
|
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Do we have more than one buf?
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (hdr->b_l1hdr.b_bufcnt > 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *nhdr;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t spa = hdr->b_spa;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t psize = HDR_GET_PSIZE(hdr);
|
|
|
|
uint64_t lsize = HDR_GET_LSIZE(hdr);
|
|
|
|
enum zio_compress compress = HDR_GET_COMPRESS(hdr);
|
2014-12-30 06:12:23 +03:00
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_t *lastbuf = NULL;
|
|
|
|
VERIFY3U(hdr->b_type, ==, type);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL);
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) remove_reference(hdr, hash_lock, tag);
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT(ARC_BUF_LAST(buf));
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Pull the data off of this hdr and attach it to
|
2016-06-02 07:04:53 +03:00
|
|
|
* a new anonymous hdr. Also find the last buffer
|
|
|
|
* in the hdr's buffer list.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
lastbuf = arc_buf_remove(hdr, buf);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(lastbuf, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* If the current arc_buf_t and the hdr are sharing their data
|
2016-07-14 00:17:41 +03:00
|
|
|
* buffer, then we must stop sharing that block.
|
2016-06-02 07:04:53 +03:00
|
|
|
*/
|
|
|
|
if (arc_buf_is_shared(buf)) {
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);
|
|
|
|
VERIFY(!arc_buf_is_shared(lastbuf));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, sever the block sharing relationship between
|
|
|
|
* buf and the arc_buf_hdr_t. Then, setup a new
|
|
|
|
* block sharing relationship with the last buffer
|
|
|
|
* on the arc_buf_t list.
|
|
|
|
*/
|
|
|
|
arc_unshare_buf(hdr, buf);
|
2016-07-11 20:45:52 +03:00
|
|
|
|
|
|
|
/*
|
2016-07-14 00:17:41 +03:00
|
|
|
* Now we need to recreate the hdr's b_pdata. Since we
|
|
|
|
* have lastbuf handy, we try to share with it, but if
|
|
|
|
* we can't then we allocate a new b_pdata and copy the
|
|
|
|
* data from buf into it.
|
2016-07-11 20:45:52 +03:00
|
|
|
*/
|
2016-07-14 00:17:41 +03:00
|
|
|
if (arc_can_share(hdr, lastbuf)) {
|
|
|
|
arc_share_buf(hdr, lastbuf);
|
|
|
|
} else {
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_hdr_alloc_pdata(hdr);
|
|
|
|
bcopy(buf->b_data, hdr->b_l1hdr.b_pdata, psize);
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
VERIFY3P(lastbuf->b_data, !=, NULL);
|
|
|
|
} else if (HDR_SHARED_DATA(hdr)) {
|
2016-07-11 20:45:52 +03:00
|
|
|
/*
|
|
|
|
* Uncompressed shared buffers are always at the end
|
|
|
|
* of the list. Compressed buffers don't have the
|
|
|
|
* same requirements. This makes it hard to
|
|
|
|
* simply assert that the lastbuf is shared so
|
|
|
|
* we rely on the hdr's compression flags to determine
|
|
|
|
* if we have a compressed, shared buffer.
|
|
|
|
*/
|
|
|
|
ASSERT(arc_buf_is_shared(lastbuf) ||
|
|
|
|
HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT(!ARC_BUF_SHARED(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT3P(state, !=, arc_l2c_only);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_remove_many(&state->arcs_size,
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2015-06-27 01:14:45 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {
|
|
|
|
ASSERT3P(state, !=, arc_l2c_only);
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_remove_many(&state->arcs_esize[type],
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_size(buf), buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2012-12-22 02:57:09 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l1hdr.b_bufcnt -= 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_cksum_verify(buf);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* Allocate a new hdr. The new hdr will contain a b_pdata
|
|
|
|
* buffer which will be freed in arc_write().
|
|
|
|
*/
|
|
|
|
nhdr = arc_hdr_alloc(spa, psize, lsize, compress, type);
|
|
|
|
ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL);
|
|
|
|
ASSERT0(nhdr->b_l1hdr.b_bufcnt);
|
|
|
|
ASSERT0(refcount_count(&nhdr->b_l1hdr.b_refcnt));
|
|
|
|
VERIFY3U(nhdr->b_type, ==, type);
|
|
|
|
ASSERT(!HDR_SHARED_DATA(nhdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
nhdr->b_l1hdr.b_buf = buf;
|
|
|
|
nhdr->b_l1hdr.b_bufcnt = 1;
|
2014-12-30 06:12:23 +03:00
|
|
|
nhdr->b_l1hdr.b_mru_hits = 0;
|
|
|
|
nhdr->b_l1hdr.b_mru_ghost_hits = 0;
|
|
|
|
nhdr->b_l1hdr.b_mfu_hits = 0;
|
|
|
|
nhdr->b_l1hdr.b_mfu_ghost_hits = 0;
|
|
|
|
nhdr->b_l1hdr.b_l2_hits = 0;
|
|
|
|
(void) refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_hdr = nhdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_add_many(&arc_anon->arcs_size,
|
|
|
|
HDR_GET_LSIZE(nhdr), buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);
|
2015-01-13 06:52:19 +03:00
|
|
|
/* protected by hash lock, or hdr is on arc_anon */
|
|
|
|
ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l1hdr.b_mru_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_mfu_ghost_hits = 0;
|
|
|
|
hdr->b_l1hdr.b_l2_hits = 0;
|
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
hdr->b_l1hdr.b_arc_access = 0;
|
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_released(arc_buf_t *buf)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
int released;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
released = (buf->b_data != NULL &&
|
|
|
|
buf->b_hdr->b_l1hdr.b_state == arc_anon);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (released);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
int
|
|
|
|
arc_referenced(arc_buf_t *buf)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
int referenced;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
referenced = (refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (referenced);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_write_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t psize = BP_IS_HOLE(zio->io_bp) ? 0 : BP_GET_PSIZE(zio->io_bp);
|
|
|
|
enum zio_compress compress;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
ASSERT(!refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_bufcnt > 0);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* If we're reexecuting this zio because the pool suspended, then
|
|
|
|
* cleanup any state that was previously set the first time the
|
2016-07-11 20:45:52 +03:00
|
|
|
* callback was invoked.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zio->io_flags & ZIO_FLAG_REEXECUTED) {
|
|
|
|
arc_cksum_free(hdr);
|
|
|
|
arc_buf_unwatch(buf);
|
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
if (arc_buf_is_shared(buf)) {
|
|
|
|
arc_unshare_buf(hdr, buf);
|
|
|
|
} else {
|
|
|
|
arc_hdr_free_pdata(hdr);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
ASSERT(!HDR_SHARED_DATA(hdr));
|
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
|
|
|
|
|
|
|
callback->awcb_ready(zio, buf, callback->awcb_private);
|
|
|
|
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr))
|
|
|
|
ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);
|
|
|
|
|
|
|
|
arc_cksum_compute(buf);
|
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
|
|
|
|
|
|
|
if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
|
|
|
|
compress = ZIO_COMPRESS_OFF;
|
|
|
|
} else {
|
|
|
|
ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(zio->io_bp));
|
|
|
|
compress = BP_GET_COMPRESS(zio->io_bp);
|
|
|
|
}
|
|
|
|
HDR_SET_PSIZE(hdr, psize);
|
|
|
|
arc_hdr_set_compress(hdr, compress);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the hdr is compressed, then copy the compressed
|
|
|
|
* zio contents into arc_buf_hdr_t. Otherwise, copy the original
|
|
|
|
* data buf into the hdr. Ideally, we would like to always copy the
|
|
|
|
* io_data into b_pdata but the user may have disabled compressed
|
|
|
|
* arc thus the on-disk block may or may not match what we maintain
|
|
|
|
* in the hdr's b_pdata field.
|
|
|
|
*/
|
2016-07-11 20:45:52 +03:00
|
|
|
if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
|
|
|
|
!ARC_BUF_COMPRESSED(buf)) {
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3U(BP_GET_COMPRESS(zio->io_bp), !=, ZIO_COMPRESS_OFF);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(psize, >, 0);
|
|
|
|
arc_hdr_alloc_pdata(hdr);
|
|
|
|
bcopy(zio->io_data, hdr->b_l1hdr.b_pdata, psize);
|
|
|
|
} else {
|
|
|
|
ASSERT3P(buf->b_data, ==, zio->io_orig_data);
|
2016-07-11 20:45:52 +03:00
|
|
|
ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This hdr is not compressed so we're able to share
|
|
|
|
* the arc_buf_t data buffer with the hdr.
|
|
|
|
*/
|
|
|
|
arc_share_buf(hdr, buf);
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT0(bcmp(zio->io_orig_data, hdr->b_l1hdr.b_pdata,
|
2016-06-02 07:04:53 +03:00
|
|
|
HDR_GET_LSIZE(hdr)));
|
|
|
|
}
|
|
|
|
arc_hdr_verify(hdr, zio->io_bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-05-15 18:02:28 +03:00
|
|
|
static void
|
|
|
|
arc_write_children_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
|
|
|
|
callback->awcb_children_ready(zio, buf, callback->awcb_private);
|
|
|
|
}
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* The SPA calls this callback for each physical write that happens on behalf
|
|
|
|
* of a logical write. See the comment in dbuf_write_physdone() for details.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_write_physdone(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *cb = zio->io_private;
|
|
|
|
if (cb->awcb_physdone != NULL)
|
|
|
|
cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_write_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (zio->io_error == 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_verify(hdr, zio->io_bp);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
|
2013-12-09 22:37:51 +04:00
|
|
|
buf_discard_identity(hdr);
|
|
|
|
} else {
|
|
|
|
hdr->b_dva = *BP_IDENTITY(zio->io_bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_EMPTY(hdr));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2014-06-06 01:19:08 +04:00
|
|
|
* If the block to be written was all-zero or compressed enough to be
|
|
|
|
* embedded in the BP, no write was performed so there will be no
|
|
|
|
* dva/birth/checksum. The buffer must therefore remain anonymous
|
|
|
|
* (and uncached).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!HDR_EMPTY(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *exists;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
2016-07-14 00:17:41 +03:00
|
|
|
ASSERT3U(zio->io_error, ==, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_cksum_verify(buf);
|
|
|
|
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
2014-12-30 06:12:23 +03:00
|
|
|
if (exists != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This can only happen if we overwrite for
|
|
|
|
* sync-to-convergence, because we remove
|
|
|
|
* buffers from the hash table when we arc_free().
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad overwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(refcount_is_zero(
|
|
|
|
&exists->b_l1hdr.b_refcnt));
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_change_state(arc_anon, exists, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
arc_hdr_destroy(exists);
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
ASSERT3P(exists, ==, NULL);
|
2013-05-10 23:47:54 +04:00
|
|
|
} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
|
|
|
|
/* nopwrite */
|
|
|
|
ASSERT(zio->io_prop.zp_nopwrite);
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad nopwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
/* Dedup */
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_bufcnt == 1);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state == arc_anon);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(BP_GET_DEDUP(zio->io_bp));
|
|
|
|
ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
2008-12-03 23:09:06 +03:00
|
|
|
/* if it's not anon, we are doing a scrub */
|
2014-12-30 06:12:23 +03:00
|
|
|
if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_access(hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!refcount_is_zero(&hdr->b_l1hdr.b_refcnt));
|
2010-05-29 00:45:14 +04:00
|
|
|
callback->awcb_done(zio, buf, callback->awcb_private);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
kmem_free(callback, sizeof (arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
zio_t *
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
|
2016-06-02 07:04:53 +03:00
|
|
|
blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc,
|
2016-05-15 18:02:28 +03:00
|
|
|
const zio_prop_t *zp, arc_done_func_t *ready,
|
|
|
|
arc_done_func_t *children_ready, arc_done_func_t *physdone,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_done_func_t *done, void *private, zio_priority_t priority,
|
2014-06-25 22:37:59 +04:00
|
|
|
int zio_flags, const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
arc_write_callback_t *callback;
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_t *zio;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(ready, !=, NULL);
|
|
|
|
ASSERT3P(done, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_ERROR(hdr));
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL);
|
|
|
|
ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (l2arc)
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);
|
2016-07-11 20:45:52 +03:00
|
|
|
if (ARC_BUF_COMPRESSED(buf)) {
|
|
|
|
ASSERT3U(zp->zp_compress, !=, ZIO_COMPRESS_OFF);
|
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
|
|
|
}
|
2014-11-21 03:09:39 +03:00
|
|
|
callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_ready = ready;
|
2016-05-15 18:02:28 +03:00
|
|
|
callback->awcb_children_ready = children_ready;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
callback->awcb_physdone = physdone;
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_done = done;
|
|
|
|
callback->awcb_private = private;
|
|
|
|
callback->awcb_buf = buf;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/*
|
|
|
|
* The hdr's b_pdata is now stale, free it now. A new data block
|
|
|
|
* will be allocated when the zio pipeline calls arc_write_ready().
|
|
|
|
*/
|
|
|
|
if (hdr->b_l1hdr.b_pdata != NULL) {
|
|
|
|
/*
|
|
|
|
* If the buf is currently sharing the data block with
|
|
|
|
* the hdr then we need to break that relationship here.
|
|
|
|
* The hdr will remain with a NULL data pointer and the
|
|
|
|
* buf will take sole ownership of the block.
|
|
|
|
*/
|
|
|
|
if (arc_buf_is_shared(buf)) {
|
|
|
|
arc_unshare_buf(hdr, buf);
|
|
|
|
} else {
|
|
|
|
arc_hdr_free_pdata(hdr);
|
|
|
|
}
|
|
|
|
VERIFY3P(buf->b_data, !=, NULL);
|
|
|
|
arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);
|
|
|
|
}
|
|
|
|
ASSERT(!arc_buf_is_shared(buf));
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, ==, NULL);
|
|
|
|
|
2016-07-11 20:45:52 +03:00
|
|
|
zio = zio_write(pio, spa, txg, bp, buf->b_data,
|
|
|
|
HDR_GET_LSIZE(hdr), arc_buf_size(buf), zp,
|
2016-05-15 18:02:28 +03:00
|
|
|
arc_write_ready,
|
|
|
|
(children_ready != NULL) ? arc_write_children_ready : NULL,
|
|
|
|
arc_write_physdone, arc_write_done, callback,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
priority, zio_flags, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_memory_throttle(uint64_t reserve, uint64_t txg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
#ifdef _KERNEL
|
2015-07-28 21:30:00 +03:00
|
|
|
uint64_t available_memory = ptob(freemem);
|
|
|
|
static uint64_t page_load = 0;
|
|
|
|
static uint64_t last_txg = 0;
|
|
|
|
#ifdef __linux__
|
|
|
|
pgcnt_t minfree = btop(arc_sys_free / 4);
|
|
|
|
#endif
|
2013-02-01 21:33:04 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
if (freemem > physmem * arc_lotsfree_percent / 100)
|
|
|
|
return (0);
|
|
|
|
|
2015-07-28 21:30:00 +03:00
|
|
|
if (txg > last_txg) {
|
|
|
|
last_txg = txg;
|
|
|
|
page_load = 0;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If we are in pageout, we know that memory is already tight,
|
|
|
|
* the arc is already going to be evicting, so we just want to
|
|
|
|
* continue to let page writes occur as quickly as possible.
|
|
|
|
*/
|
|
|
|
if (current_is_kswapd()) {
|
|
|
|
if (page_load > MAX(ptob(minfree), available_memory) / 4) {
|
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim);
|
|
|
|
return (SET_ERROR(ERESTART));
|
|
|
|
}
|
|
|
|
/* Note: reserve is inflated, so we deflate */
|
|
|
|
page_load += reserve / 8;
|
|
|
|
return (0);
|
|
|
|
} else if (page_load > 0 && arc_reclaim_needed()) {
|
2015-06-26 21:28:18 +03:00
|
|
|
/* memory is low, delay before restarting */
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
|
2012-01-20 22:58:57 +04:00
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EAGAIN));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-07-28 21:30:00 +03:00
|
|
|
page_load = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_tempreserve_clear(uint64_t reserve)
|
|
|
|
{
|
|
|
|
atomic_add_64(&arc_tempreserve, -reserve);
|
|
|
|
ASSERT((int64_t)arc_tempreserve >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_tempreserve_space(uint64_t reserve, uint64_t txg)
|
|
|
|
{
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t anon_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-22 16:37:37 +03:00
|
|
|
if (!arc_no_grow &&
|
|
|
|
reserve > arc_c/4 &&
|
|
|
|
reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT))
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_c = MIN(arc_c_max, reserve * 4);
|
2014-04-29 00:56:47 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle when the calculated memory footprint for the TXG
|
|
|
|
* exceeds the target ARC size.
|
|
|
|
*/
|
2012-01-20 22:58:57 +04:00
|
|
|
if (reserve > arc_c) {
|
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
|
2014-04-29 00:56:47 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2012-01-20 22:58:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Don't count loaned bufs as in flight dirty data to prevent long
|
|
|
|
* network delays from blocking transactions that are ready to be
|
|
|
|
* assigned to a txg.
|
|
|
|
*/
|
2015-06-27 01:14:45 +03:00
|
|
|
anon_size = MAX((int64_t)(refcount_count(&arc_anon->arcs_size) -
|
|
|
|
arc_loaned_bytes), 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Writes will, almost always, require additional memory allocations
|
2013-06-11 21:12:34 +04:00
|
|
|
* in order to compress/encrypt/etc the data. We therefore need to
|
2008-11-20 23:01:55 +03:00
|
|
|
* make sure that there is sufficient available memory for this.
|
|
|
|
*/
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
error = arc_memory_throttle(reserve, txg);
|
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle writes when the amount of dirty data in the cache
|
|
|
|
* gets too large. We try to keep the cache less than half full
|
|
|
|
* of dirty blocks so that our sync times don't grow too large.
|
|
|
|
* Note: if two requests come in concurrently, we might let them
|
|
|
|
* both succeed, when one of them should fail. Not a huge deal.
|
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
|
|
|
|
anon_size > arc_c / 4) {
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t meta_esize =
|
|
|
|
refcount_count(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
uint64_t data_esize =
|
|
|
|
refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
2008-11-20 23:01:55 +03:00
|
|
|
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
|
|
|
|
"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_tempreserve >> 10, meta_esize >> 10,
|
|
|
|
data_esize >> 10, reserve >> 10, arc_c >> 10);
|
2012-01-20 22:58:57 +04:00
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
atomic_add_64(&arc_tempreserve, reserve);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2012-01-31 01:28:40 +04:00
|
|
|
static void
|
|
|
|
arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
|
|
|
|
kstat_named_t *evict_data, kstat_named_t *evict_metadata)
|
|
|
|
{
|
2015-06-27 01:14:45 +03:00
|
|
|
size->value.ui64 = refcount_count(&state->arcs_size);
|
2016-06-02 07:04:53 +03:00
|
|
|
evict_data->value.ui64 =
|
|
|
|
refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
evict_metadata->value.ui64 =
|
|
|
|
refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);
|
2012-01-31 01:28:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
arc_kstat_update(kstat_t *ksp, int rw)
|
|
|
|
{
|
|
|
|
arc_stats_t *as = ksp->ks_data;
|
|
|
|
|
|
|
|
if (rw == KSTAT_WRITE) {
|
2015-06-27 00:54:17 +03:00
|
|
|
return (EACCES);
|
2012-01-31 01:28:40 +04:00
|
|
|
} else {
|
|
|
|
arc_kstat_update_state(arc_anon,
|
|
|
|
&as->arcstat_anon_size,
|
2015-06-27 00:54:17 +03:00
|
|
|
&as->arcstat_anon_evictable_data,
|
|
|
|
&as->arcstat_anon_evictable_metadata);
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_kstat_update_state(arc_mru,
|
|
|
|
&as->arcstat_mru_size,
|
2015-06-27 00:54:17 +03:00
|
|
|
&as->arcstat_mru_evictable_data,
|
|
|
|
&as->arcstat_mru_evictable_metadata);
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_kstat_update_state(arc_mru_ghost,
|
|
|
|
&as->arcstat_mru_ghost_size,
|
2015-06-27 00:54:17 +03:00
|
|
|
&as->arcstat_mru_ghost_evictable_data,
|
|
|
|
&as->arcstat_mru_ghost_evictable_metadata);
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_kstat_update_state(arc_mfu,
|
|
|
|
&as->arcstat_mfu_size,
|
2015-06-27 00:54:17 +03:00
|
|
|
&as->arcstat_mfu_evictable_data,
|
|
|
|
&as->arcstat_mfu_evictable_metadata);
|
2012-03-27 21:10:26 +04:00
|
|
|
arc_kstat_update_state(arc_mfu_ghost,
|
2012-01-31 01:28:40 +04:00
|
|
|
&as->arcstat_mfu_ghost_size,
|
2015-06-27 00:54:17 +03:00
|
|
|
&as->arcstat_mfu_ghost_evictable_data,
|
|
|
|
&as->arcstat_mfu_ghost_evictable_metadata);
|
2012-01-31 01:28:40 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* This function *must* return indices evenly distributed between all
|
|
|
|
* sublists of the multilist. This is needed due to how the ARC eviction
|
|
|
|
* code is laid out; arc_evict_state() assumes ARC buffers are evenly
|
|
|
|
* distributed between all sublists and uses this assumption when
|
|
|
|
* deciding which sublist to evict from and how much to evict from it.
|
|
|
|
*/
|
|
|
|
unsigned int
|
|
|
|
arc_state_multilist_index_func(multilist_t *ml, void *obj)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = obj;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We rely on b_dva to generate evenly distributed index
|
|
|
|
* numbers using buf_hash below. So, as an added precaution,
|
|
|
|
* let's make sure we never add empty buffers to the arc lists.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(!HDR_EMPTY(hdr));
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The assumption here, is the hash value for a given
|
|
|
|
* arc_buf_hdr_t will remain constant throughout its lifetime
|
|
|
|
* (i.e. its b_spa, b_dva, and b_birth fields don't change).
|
|
|
|
* Thus, we don't need to store the header's sublist index
|
|
|
|
* on insertion, as this index can be recalculated on removal.
|
|
|
|
*
|
|
|
|
* Also, the low order bits of the hash value are thought to be
|
|
|
|
* distributed evenly. Otherwise, in the case that the multilist
|
|
|
|
* has a power of two number of sublists, each sublists' usage
|
|
|
|
* would not be evenly distributed.
|
|
|
|
*/
|
|
|
|
return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %
|
|
|
|
multilist_get_num_sublists(ml));
|
|
|
|
}
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* Called during module initialization and periodically thereafter to
|
|
|
|
* apply reasonable changes to the exposed performance tunings. Non-zero
|
|
|
|
* zfs_* values which differ from the currently set values will be applied.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_tuning_update(void)
|
|
|
|
{
|
2016-08-11 06:15:37 +03:00
|
|
|
uint64_t percent;
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Valid range: 64M - <all physical memory> */
|
|
|
|
if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&
|
|
|
|
(zfs_arc_max > 64 << 20) && (zfs_arc_max < ptob(physmem)) &&
|
|
|
|
(zfs_arc_max > arc_c_min)) {
|
|
|
|
arc_c_max = zfs_arc_max;
|
|
|
|
arc_c = arc_c_max;
|
|
|
|
arc_p = (arc_c >> 1);
|
2016-08-11 06:15:37 +03:00
|
|
|
/* Valid range of arc_meta_limit: arc_meta_min - arc_c_max */
|
|
|
|
percent = MIN(zfs_arc_meta_limit_percent, 100);
|
|
|
|
arc_meta_limit = MAX(arc_meta_min, (percent * arc_c_max) / 100);
|
|
|
|
percent = MIN(zfs_arc_dnode_limit_percent, 100);
|
|
|
|
arc_dnode_limit = (percent * arc_meta_limit) / 100;
|
2015-06-26 21:28:18 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Valid range: 32M - <arc_c_max> */
|
|
|
|
if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) &&
|
|
|
|
(zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) &&
|
|
|
|
(zfs_arc_min <= arc_c_max)) {
|
|
|
|
arc_c_min = zfs_arc_min;
|
|
|
|
arc_c = MAX(arc_c, arc_c_min);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Valid range: 16M - <arc_c_max> */
|
|
|
|
if ((zfs_arc_meta_min) && (zfs_arc_meta_min != arc_meta_min) &&
|
|
|
|
(zfs_arc_meta_min >= 1ULL << SPA_MAXBLOCKSHIFT) &&
|
|
|
|
(zfs_arc_meta_min <= arc_c_max)) {
|
|
|
|
arc_meta_min = zfs_arc_meta_min;
|
|
|
|
arc_meta_limit = MAX(arc_meta_limit, arc_meta_min);
|
2016-07-13 15:42:40 +03:00
|
|
|
arc_dnode_limit = arc_meta_limit / 10;
|
2015-06-26 21:28:18 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Valid range: <arc_meta_min> - <arc_c_max> */
|
|
|
|
if ((zfs_arc_meta_limit) && (zfs_arc_meta_limit != arc_meta_limit) &&
|
|
|
|
(zfs_arc_meta_limit >= zfs_arc_meta_min) &&
|
|
|
|
(zfs_arc_meta_limit <= arc_c_max))
|
|
|
|
arc_meta_limit = zfs_arc_meta_limit;
|
|
|
|
|
2016-07-13 15:42:40 +03:00
|
|
|
/* Valid range: <arc_meta_min> - <arc_c_max> */
|
|
|
|
if ((zfs_arc_dnode_limit) && (zfs_arc_dnode_limit != arc_dnode_limit) &&
|
|
|
|
(zfs_arc_dnode_limit >= zfs_arc_meta_min) &&
|
|
|
|
(zfs_arc_dnode_limit <= arc_c_max))
|
|
|
|
arc_dnode_limit = zfs_arc_dnode_limit;
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Valid range: 1 - N */
|
|
|
|
if (zfs_arc_grow_retry)
|
|
|
|
arc_grow_retry = zfs_arc_grow_retry;
|
|
|
|
|
|
|
|
/* Valid range: 1 - N */
|
|
|
|
if (zfs_arc_shrink_shift) {
|
|
|
|
arc_shrink_shift = zfs_arc_shrink_shift;
|
|
|
|
arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1);
|
|
|
|
}
|
|
|
|
|
2015-06-27 01:59:23 +03:00
|
|
|
/* Valid range: 1 - N */
|
|
|
|
if (zfs_arc_p_min_shift)
|
|
|
|
arc_p_min_shift = zfs_arc_p_min_shift;
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Valid range: 1 - N ticks */
|
|
|
|
if (zfs_arc_min_prefetch_lifespan)
|
|
|
|
arc_min_prefetch_lifespan = zfs_arc_min_prefetch_lifespan;
|
2015-07-27 23:17:32 +03:00
|
|
|
|
2015-07-28 21:30:00 +03:00
|
|
|
/* Valid range: 0 - 100 */
|
|
|
|
if ((zfs_arc_lotsfree_percent >= 0) &&
|
|
|
|
(zfs_arc_lotsfree_percent <= 100))
|
|
|
|
arc_lotsfree_percent = zfs_arc_lotsfree_percent;
|
|
|
|
|
2015-07-27 23:17:32 +03:00
|
|
|
/* Valid range: 0 - <all physical memory> */
|
|
|
|
if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))
|
|
|
|
arc_sys_free = MIN(MAX(zfs_arc_sys_free, 0), ptob(physmem));
|
2015-07-28 21:30:00 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
static void
|
|
|
|
arc_state_init(void)
|
|
|
|
{
|
|
|
|
arc_anon = &ARC_anon;
|
|
|
|
arc_mru = &ARC_mru;
|
|
|
|
arc_mru_ghost = &ARC_mru_ghost;
|
|
|
|
arc_mfu = &ARC_mfu;
|
|
|
|
arc_mfu_ghost = &ARC_mfu_ghost;
|
|
|
|
arc_l2c_only = &ARC_l2c_only;
|
|
|
|
|
|
|
|
multilist_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
multilist_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node),
|
|
|
|
zfs_arc_num_sublists_per_state, arc_state_multilist_index_func);
|
|
|
|
|
|
|
|
refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
|
|
|
|
refcount_create(&arc_anon->arcs_size);
|
|
|
|
refcount_create(&arc_mru->arcs_size);
|
|
|
|
refcount_create(&arc_mru_ghost->arcs_size);
|
|
|
|
refcount_create(&arc_mfu->arcs_size);
|
|
|
|
refcount_create(&arc_mfu_ghost->arcs_size);
|
|
|
|
refcount_create(&arc_l2c_only->arcs_size);
|
|
|
|
|
|
|
|
arc_anon->arcs_state = ARC_STATE_ANON;
|
|
|
|
arc_mru->arcs_state = ARC_STATE_MRU;
|
|
|
|
arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;
|
|
|
|
arc_mfu->arcs_state = ARC_STATE_MFU;
|
|
|
|
arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;
|
|
|
|
arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_state_fini(void)
|
|
|
|
{
|
|
|
|
refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);
|
|
|
|
refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);
|
|
|
|
|
|
|
|
refcount_destroy(&arc_anon->arcs_size);
|
|
|
|
refcount_destroy(&arc_mru->arcs_size);
|
|
|
|
refcount_destroy(&arc_mru_ghost->arcs_size);
|
|
|
|
refcount_destroy(&arc_mfu->arcs_size);
|
|
|
|
refcount_destroy(&arc_mfu_ghost->arcs_size);
|
|
|
|
refcount_destroy(&arc_l2c_only->arcs_size);
|
|
|
|
|
|
|
|
multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
|
2016-09-23 20:55:10 +03:00
|
|
|
multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]);
|
2016-06-02 07:04:53 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
arc_max_bytes(void)
|
|
|
|
{
|
|
|
|
return (arc_c_max);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_init(void)
|
|
|
|
{
|
2015-06-26 21:28:18 +03:00
|
|
|
/*
|
|
|
|
* allmem is "all memory that we could possibly use".
|
|
|
|
*/
|
|
|
|
#ifdef _KERNEL
|
|
|
|
uint64_t allmem = ptob(physmem);
|
|
|
|
#else
|
|
|
|
uint64_t allmem = (physmem * PAGESIZE) / 2;
|
|
|
|
#endif
|
2016-08-11 06:15:37 +03:00
|
|
|
uint64_t percent;
|
2015-06-26 21:28:18 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_init(&arc_reclaim_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&arc_reclaim_thread_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
cv_init(&arc_reclaim_waiters_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* Convert seconds to clock ticks */
|
2015-06-26 21:28:18 +03:00
|
|
|
arc_min_prefetch_lifespan = 1 * hz;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#ifdef _KERNEL
|
2011-03-30 05:08:59 +04:00
|
|
|
/*
|
|
|
|
* Register a shrinker to support synchronous (direct) memory
|
|
|
|
* reclaim from the arc. This is done to prevent kswapd from
|
|
|
|
* swapping out pages when it is preferable to shrink the arc.
|
|
|
|
*/
|
|
|
|
spl_register_shrinker(&arc_shrinker);
|
2015-07-27 23:17:32 +03:00
|
|
|
|
|
|
|
/* Set to 1/64 of all memory or a minimum of 512K */
|
|
|
|
arc_sys_free = MAX(ptob(physmem / 64), (512 * 1024));
|
|
|
|
arc_need_free = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
|
2016-01-24 22:11:15 +03:00
|
|
|
/* Set max to 1/2 of all memory */
|
|
|
|
arc_c_max = allmem / 2;
|
|
|
|
|
2016-01-12 00:52:17 +03:00
|
|
|
/*
|
|
|
|
* In userland, there's only the memory pressure that we artificially
|
|
|
|
* create (see arc_available_memory()). Don't let arc_c get too
|
|
|
|
* small, because it can cause transactions to be larger than
|
|
|
|
* arc_c, causing arc_tempreserve_space() to fail.
|
|
|
|
*/
|
|
|
|
#ifndef _KERNEL
|
2016-01-24 22:11:15 +03:00
|
|
|
arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT);
|
2016-01-12 00:52:17 +03:00
|
|
|
#else
|
2015-06-04 16:06:27 +03:00
|
|
|
arc_c_min = 2ULL << SPA_MAXBLOCKSHIFT;
|
2016-01-12 00:52:17 +03:00
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_c = arc_c_max;
|
|
|
|
arc_p = (arc_c >> 1);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_size = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Set min to 1/2 of arc_c_min */
|
|
|
|
arc_meta_min = 1ULL << SPA_MAXBLOCKSHIFT;
|
|
|
|
/* Initialize maximum observed usage to zero */
|
2011-03-24 22:13:55 +03:00
|
|
|
arc_meta_max = 0;
|
2016-08-11 06:15:37 +03:00
|
|
|
/*
|
|
|
|
* Set arc_meta_limit to a percent of arc_c_max with a floor of
|
|
|
|
* arc_meta_min, and a ceiling of arc_c_max.
|
|
|
|
*/
|
|
|
|
percent = MIN(zfs_arc_meta_limit_percent, 100);
|
|
|
|
arc_meta_limit = MAX(arc_meta_min, (percent * arc_c_max) / 100);
|
|
|
|
percent = MIN(zfs_arc_dnode_limit_percent, 100);
|
|
|
|
arc_dnode_limit = (percent * arc_meta_limit) / 100;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
/* Apply user specified tunings */
|
|
|
|
arc_tuning_update();
|
2015-06-25 01:49:08 +03:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
if (zfs_arc_num_sublists_per_state < 1)
|
2015-06-26 21:28:18 +03:00
|
|
|
zfs_arc_num_sublists_per_state = MAX(boot_ncpus, 1);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* if kmem_flags are set, lets try to use less memory */
|
|
|
|
if (kmem_debugging())
|
|
|
|
arc_c = arc_c / 2;
|
|
|
|
if (arc_c < arc_c_min)
|
|
|
|
arc_c = arc_c_min;
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_state_init();
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_init();
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
list_create(&arc_prune_list, sizeof (arc_prune_t),
|
|
|
|
offsetof(arc_prune_t, p_node));
|
|
|
|
mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-07-24 20:08:31 +03:00
|
|
|
arc_prune_taskq = taskq_create("arc_prune", max_ncpus, defclsyspri,
|
2015-06-03 21:43:30 +03:00
|
|
|
max_ncpus, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC);
|
2015-05-30 17:57:53 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_reclaim_thread_exit = B_FALSE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
|
|
|
|
sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
|
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
arc_ksp->ks_data = &arc_stats;
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_ksp->ks_update = arc_kstat_update;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_install(arc_ksp);
|
|
|
|
}
|
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
(void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
|
2015-07-24 20:08:31 +03:00
|
|
|
TS_RUN, defclsyspri);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_dead = B_FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_warm = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* Calculate maximum amount of dirty data per pool.
|
|
|
|
*
|
|
|
|
* If it has been set by a module parameter, take that.
|
|
|
|
* Otherwise, use a percentage of physical memory defined by
|
|
|
|
* zfs_dirty_data_max_percent (default 10%) with a cap at
|
|
|
|
* zfs_dirty_data_max_max (default 25% of physical memory).
|
|
|
|
*/
|
|
|
|
if (zfs_dirty_data_max_max == 0)
|
2015-10-31 02:10:01 +03:00
|
|
|
zfs_dirty_data_max_max = (uint64_t)physmem * PAGESIZE *
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
zfs_dirty_data_max_max_percent / 100;
|
|
|
|
|
|
|
|
if (zfs_dirty_data_max == 0) {
|
2015-10-31 02:10:01 +03:00
|
|
|
zfs_dirty_data_max = (uint64_t)physmem * PAGESIZE *
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
zfs_dirty_data_max_percent / 100;
|
|
|
|
zfs_dirty_data_max = MIN(zfs_dirty_data_max,
|
|
|
|
zfs_dirty_data_max_max);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_fini(void)
|
|
|
|
{
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *p;
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
|
|
|
spl_unregister_shrinker(&arc_shrinker);
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&arc_reclaim_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_reclaim_thread_exit = B_TRUE;
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* The reclaim thread will set arc_reclaim_thread_exit back to
|
2016-06-02 07:04:53 +03:00
|
|
|
* B_FALSE when it is finished exiting; we're waiting for that.
|
2015-01-13 06:52:19 +03:00
|
|
|
*/
|
|
|
|
while (arc_reclaim_thread_exit) {
|
|
|
|
cv_signal(&arc_reclaim_thread_cv);
|
|
|
|
cv_wait(&arc_reclaim_thread_cv, &arc_reclaim_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_reclaim_lock);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* Use B_TRUE to ensure *all* buffers are evicted */
|
|
|
|
arc_flush(NULL, B_TRUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_dead = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
kstat_delete(arc_ksp);
|
|
|
|
arc_ksp = NULL;
|
|
|
|
}
|
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
taskq_wait(arc_prune_taskq);
|
|
|
|
taskq_destroy(arc_prune_taskq);
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
while ((p = list_head(&arc_prune_list)) != NULL) {
|
|
|
|
list_remove(&arc_prune_list, p);
|
|
|
|
refcount_remove(&p->p_refcnt, &arc_prune_list);
|
|
|
|
refcount_destroy(&p->p_refcnt);
|
|
|
|
kmem_free(p, sizeof (*p));
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
list_destroy(&arc_prune_list);
|
|
|
|
mutex_destroy(&arc_prune_mtx);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_destroy(&arc_reclaim_lock);
|
|
|
|
cv_destroy(&arc_reclaim_thread_cv);
|
|
|
|
cv_destroy(&arc_reclaim_waiters_cv);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_state_fini();
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_fini();
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT0(arc_loaned_bytes);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*
|
|
|
|
* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
|
|
|
|
* It uses dedicated storage devices to hold cached data, which are populated
|
|
|
|
* using large infrequent writes. The main role of this cache is to boost
|
|
|
|
* the performance of random read workloads. The intended L2ARC devices
|
|
|
|
* include short-stroked disks, solid state disks, and other media with
|
|
|
|
* substantially faster read latency than disk.
|
|
|
|
*
|
|
|
|
* +-----------------------+
|
|
|
|
* | ARC |
|
|
|
|
* +-----------------------+
|
|
|
|
* | ^ ^
|
|
|
|
* | | |
|
|
|
|
* l2arc_feed_thread() arc_read()
|
|
|
|
* | | |
|
|
|
|
* | l2arc read |
|
|
|
|
* V | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | L2ARC | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | ^ |
|
|
|
|
* l2arc_write() | |
|
|
|
|
* | | |
|
|
|
|
* V | |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* | vdev | | vdev |
|
|
|
|
* | cache | | cache |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* +=========+ .-----.
|
|
|
|
* : L2ARC : |-_____-|
|
|
|
|
* : devices : | Disks |
|
|
|
|
* +=========+ `-_____-'
|
|
|
|
*
|
|
|
|
* Read requests are satisfied from the following sources, in order:
|
|
|
|
*
|
|
|
|
* 1) ARC
|
|
|
|
* 2) vdev cache of L2ARC devices
|
|
|
|
* 3) L2ARC devices
|
|
|
|
* 4) vdev cache of disks
|
|
|
|
* 5) disks
|
|
|
|
*
|
|
|
|
* Some L2ARC device types exhibit extremely slow write performance.
|
|
|
|
* To accommodate for this there are some significant differences between
|
|
|
|
* the L2ARC and traditional cache design:
|
|
|
|
*
|
|
|
|
* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
|
|
|
|
* the ARC behave as usual, freeing buffers and placing headers on ghost
|
|
|
|
* lists. The ARC does not send buffers to the L2ARC during eviction as
|
|
|
|
* this would add inflated write latencies for all ARC memory pressure.
|
|
|
|
*
|
|
|
|
* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
|
|
|
|
* It does this by periodically scanning buffers from the eviction-end of
|
|
|
|
* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
|
2013-08-02 00:02:10 +04:00
|
|
|
* not already there. It scans until a headroom of buffers is satisfied,
|
|
|
|
* which itself is a buffer for ARC eviction. If a compressible buffer is
|
|
|
|
* found during scanning and selected for writing to an L2ARC device, we
|
|
|
|
* temporarily boost scanning headroom during the next scan cycle to make
|
|
|
|
* sure we adapt to compression effects (which might significantly reduce
|
|
|
|
* the data volume we write to L2ARC). The thread that does this is
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_thread(), illustrated below; example sizes are included to
|
|
|
|
* provide a better sense of ratio than this diagram:
|
|
|
|
*
|
|
|
|
* head --> tail
|
|
|
|
* +---------------------+----------+
|
|
|
|
* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
|
|
|
|
* +---------------------+----------+ | o L2ARC eligible
|
|
|
|
* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
|
|
|
|
* +---------------------+----------+ |
|
|
|
|
* 15.9 Gbytes ^ 32 Mbytes |
|
|
|
|
* headroom |
|
|
|
|
* l2arc_feed_thread()
|
|
|
|
* |
|
|
|
|
* l2arc write hand <--[oooo]--'
|
|
|
|
* | 8 Mbyte
|
|
|
|
* | write max
|
|
|
|
* V
|
|
|
|
* +==============================+
|
|
|
|
* L2ARC dev |####|#|###|###| |####| ... |
|
|
|
|
* +==============================+
|
|
|
|
* 32 Gbytes
|
|
|
|
*
|
|
|
|
* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
|
|
|
|
* evicted, then the L2ARC has cached a buffer much sooner than it probably
|
|
|
|
* needed to, potentially wasting L2ARC device bandwidth and storage. It is
|
|
|
|
* safe to say that this is an uncommon case, since buffers at the end of
|
|
|
|
* the ARC lists have moved there due to inactivity.
|
|
|
|
*
|
|
|
|
* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
|
|
|
|
* then the L2ARC simply misses copying some buffers. This serves as a
|
|
|
|
* pressure valve to prevent heavy read workloads from both stalling the ARC
|
|
|
|
* with waits and clogging the L2ARC with writes. This also helps prevent
|
|
|
|
* the potential for the L2ARC to churn if it attempts to cache content too
|
|
|
|
* quickly, such as during backups of the entire pool.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 5. After system boot and before the ARC has filled main memory, there are
|
|
|
|
* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
|
|
|
|
* lists can remain mostly static. Instead of searching from tail of these
|
|
|
|
* lists as pictured, the l2arc_feed_thread() will search from the list heads
|
|
|
|
* for eligible buffers, greatly increasing its chance of finding them.
|
|
|
|
*
|
|
|
|
* The L2ARC device write speed is also boosted during this time so that
|
|
|
|
* the L2ARC warms up faster. Since there have been no ARC evictions yet,
|
|
|
|
* there are no L2ARC reads, and no fear of degrading read performance
|
|
|
|
* through increased writes.
|
|
|
|
*
|
|
|
|
* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
|
2008-11-20 23:01:55 +03:00
|
|
|
* the vdev queue can aggregate them into larger and fewer writes. Each
|
|
|
|
* device is written to in a rotor fashion, sweeping writes through
|
|
|
|
* available space then repeating.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 7. The L2ARC does not store dirty content. It never needs to flush
|
2008-11-20 23:01:55 +03:00
|
|
|
* write buffers back to disk based storage.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 8. If an ARC buffer is written (and dirtied) which also exists in the
|
2008-11-20 23:01:55 +03:00
|
|
|
* L2ARC, the now stale L2ARC buffer is immediately dropped.
|
|
|
|
*
|
|
|
|
* The performance of the L2ARC can be tweaked by a number of tunables, which
|
|
|
|
* may be necessary for different workloads:
|
|
|
|
*
|
|
|
|
* l2arc_write_max max write bytes per interval
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc_write_boost extra write bytes during device warmup
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_noprefetch skip caching prefetched buffers
|
|
|
|
* l2arc_headroom number of max device writes to precache
|
2013-08-02 00:02:10 +04:00
|
|
|
* l2arc_headroom_boost when we find compressed buffers during ARC
|
|
|
|
* scanning, we multiply headroom by this
|
|
|
|
* percentage factor for the next scan cycle,
|
|
|
|
* since more compressed buffers are likely to
|
|
|
|
* be present
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_secs seconds between L2ARC writing
|
|
|
|
*
|
|
|
|
* Tunables may be removed or added as future performance improvements are
|
|
|
|
* integrated, and also may become zpool properties.
|
2009-02-18 23:51:31 +03:00
|
|
|
*
|
|
|
|
* There are three key functions that control how the L2ARC warms up:
|
|
|
|
*
|
|
|
|
* l2arc_write_eligible() check if a buffer is eligible to cache
|
|
|
|
* l2arc_write_size() calculate how much to write
|
|
|
|
* l2arc_write_interval() calculate sleep delay between writes
|
|
|
|
*
|
|
|
|
* These three functions determine what to write, how much, and how quickly
|
|
|
|
* to send writes.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
static boolean_t
|
2014-12-06 20:24:32 +03:00
|
|
|
l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A buffer is *not* eligible for the L2ARC if it:
|
|
|
|
* 1. belongs to a different spa.
|
2010-05-29 00:45:14 +04:00
|
|
|
* 2. is already cached on the L2ARC.
|
|
|
|
* 3. has an I/O in progress (it may be an incomplete read).
|
|
|
|
* 4. is flagged not eligible (zfs property).
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||
|
2014-12-06 20:24:32 +03:00
|
|
|
HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))
|
2009-02-18 23:51:31 +03:00
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_write_size(void)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
|
|
|
uint64_t size;
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* Make sure our globals have meaningful values in case the user
|
|
|
|
* altered them.
|
|
|
|
*/
|
|
|
|
size = l2arc_write_max;
|
|
|
|
if (size == 0) {
|
|
|
|
cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
|
|
|
|
"be greater than zero, resetting it to the default (%d)",
|
|
|
|
L2ARC_WRITE_SIZE);
|
|
|
|
size = l2arc_write_max = L2ARC_WRITE_SIZE;
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
if (arc_warm == B_FALSE)
|
2013-08-02 00:02:10 +04:00
|
|
|
size += l2arc_write_boost;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
return (size);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static clock_t
|
|
|
|
l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t interval, next, now;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ARC lists are busy, increase our write rate; if the
|
|
|
|
* lists are stale, idle back. This is achieved by checking
|
|
|
|
* how much we previously wrote - if it was more than half of
|
|
|
|
* what we wanted, schedule the next write much sooner.
|
|
|
|
*/
|
|
|
|
if (l2arc_feed_again && wrote > (wanted / 2))
|
|
|
|
interval = (hz * l2arc_feed_min_ms) / 1000;
|
|
|
|
else
|
|
|
|
interval = hz * l2arc_feed_secs;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
now = ddi_get_lbolt();
|
|
|
|
next = MAX(now, MIN(now + interval, began + interval));
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Cycle through L2ARC devices. This is how L2ARC load balances.
|
2008-12-03 23:09:06 +03:00
|
|
|
* If a device is returned, this also returns holding the spa config lock.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static l2arc_dev_t *
|
|
|
|
l2arc_dev_get_next(void)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_dev_t *first, *next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Lock out the removal of spas (spa_namespace_lock), then removal
|
|
|
|
* of cache devices (l2arc_dev_mtx). Once a device has been selected,
|
|
|
|
* both locks will be dropped and a spa config lock held instead.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/* if there are no vdevs, there is nothing to do */
|
|
|
|
if (l2arc_ndev == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
first = NULL;
|
|
|
|
next = l2arc_dev_last;
|
|
|
|
do {
|
|
|
|
/* loop around the list looking for a non-faulted vdev */
|
|
|
|
if (next == NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
next = list_head(l2arc_dev_list);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
next = list_next(l2arc_dev_list, next);
|
|
|
|
if (next == NULL)
|
|
|
|
next = list_head(l2arc_dev_list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we have come back to the start, bail out */
|
|
|
|
if (first == NULL)
|
|
|
|
first = next;
|
|
|
|
else if (next == first)
|
|
|
|
break;
|
|
|
|
|
|
|
|
} while (vdev_is_dead(next->l2ad_vdev));
|
|
|
|
|
|
|
|
/* if we were unable to find any usable vdevs, return NULL */
|
|
|
|
if (vdev_is_dead(next->l2ad_vdev))
|
|
|
|
next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
l2arc_dev_last = next;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
out:
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Grab the config lock to prevent the 'next' device from being
|
|
|
|
* removed while we are writing to it.
|
|
|
|
*/
|
|
|
|
if (next != NULL)
|
|
|
|
spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Free buffers that were tagged for destruction.
|
|
|
|
*/
|
|
|
|
static void
|
2010-08-26 20:52:41 +04:00
|
|
|
l2arc_do_free_on_write(void)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
|
|
|
list_t *buflist;
|
|
|
|
l2arc_data_free_t *df, *df_prev;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
|
|
|
buflist = l2arc_free_on_write;
|
|
|
|
|
|
|
|
for (df = list_tail(buflist); df; df = df_prev) {
|
|
|
|
df_prev = list_prev(buflist, df);
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(df->l2df_data, !=, NULL);
|
|
|
|
if (df->l2df_type == ARC_BUFC_METADATA) {
|
|
|
|
zio_buf_free(df->l2df_data, df->l2df_size);
|
|
|
|
} else {
|
|
|
|
ASSERT(df->l2df_type == ARC_BUFC_DATA);
|
|
|
|
zio_data_buf_free(df->l2df_data, df->l2df_size);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
list_remove(buflist, df);
|
|
|
|
kmem_free(df, sizeof (l2arc_data_free_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* A write to a cache device has completed. Update all headers to allow
|
|
|
|
* reads from these buffers to begin.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_write_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
l2arc_write_callback_t *cb;
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
list_t *buflist;
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *head, *hdr, *hdr_prev;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
2014-05-22 13:11:57 +04:00
|
|
|
int64_t bytes_dropped = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = zio->io_private;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(cb, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
dev = cb->l2wcb_dev;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(dev, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
head = cb->l2wcb_head;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(head, !=, NULL);
|
2014-12-30 06:12:23 +03:00
|
|
|
buflist = &dev->l2ad_buflist;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(buflist, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
|
|
|
|
l2arc_write_callback_t *, cb);
|
|
|
|
|
|
|
|
if (zio->io_error != 0)
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All writes completed, or an error was hit.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
top:
|
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {
|
|
|
|
hdr_prev = list_prev(buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot use mutex_enter or else we can deadlock
|
|
|
|
* with l2arc_write_buffers (due to swapping the order
|
|
|
|
* the hash lock and l2ad_mtx are taken).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Missed the hash lock. We must retry so we
|
|
|
|
* don't leave the ARC_FLAG_L2_WRITING bit set.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to rescan the headers we've
|
|
|
|
* already marked as having been written out, so
|
|
|
|
* we reinsert the head node so we can pick up
|
|
|
|
* where we left off.
|
|
|
|
*/
|
|
|
|
list_remove(buflist, head);
|
|
|
|
list_insert_after(buflist, hdr, head);
|
|
|
|
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We wait for the hash lock to become available
|
|
|
|
* to try and prevent busy waiting, and increase
|
|
|
|
* the chance we'll be able to acquire the lock
|
|
|
|
* the next time around.
|
|
|
|
*/
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* We could not have been moved into the arc_l2c_only
|
|
|
|
* state while in-flight due to our ARC_FLAG_L2_WRITING
|
|
|
|
* bit being set. Let's just ensure that's being enforced.
|
|
|
|
*/
|
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
|
|
|
|
2016-02-10 21:42:01 +03:00
|
|
|
/*
|
|
|
|
* Skipped - drop L2ARC entry and mark the header as no
|
|
|
|
* longer L2 eligibile.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zio->io_error != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Error - drop L2ARC entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
list_remove(buflist, hdr);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);
|
2014-12-30 06:12:23 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -arc_hdr_size(hdr));
|
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -HDR_GET_LSIZE(hdr));
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
bytes_dropped += arc_hdr_size(hdr);
|
2015-06-16 02:12:19 +03:00
|
|
|
(void) refcount_remove_many(&dev->l2ad_alloc,
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_size(hdr), hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-01-13 06:52:19 +03:00
|
|
|
* Allow ARC to begin reads and ghost list evictions to
|
|
|
|
* this L2ARC entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_inc_64(&l2arc_writes_done);
|
|
|
|
list_remove(buflist, head);
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(!HDR_HAS_L1HDR(head));
|
|
|
|
kmem_cache_free(hdr_l2only_cache, head);
|
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_do_free_on_write();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A read to a cache device completed. Validate buffer contents before
|
|
|
|
* handing over to the regular ARC routines.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_read_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
2016-06-02 07:04:53 +03:00
|
|
|
boolean_t valid_cksum;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(zio->io_vd, !=, NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
|
|
|
|
|
|
|
|
spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = zio->io_private;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(cb, !=, NULL);
|
|
|
|
hdr = cb->l2rcb_hdr;
|
|
|
|
ASSERT3P(hdr, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(zio->io_data, !=, NULL);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Check this survived the L2ARC journey.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(zio->io_data, ==, hdr->b_l1hdr.b_pdata);
|
|
|
|
zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
|
|
|
|
zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
|
|
|
|
|
|
|
|
valid_cksum = arc_cksum_is_equal(hdr, zio);
|
|
|
|
if (valid_cksum && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
2016-06-02 07:04:53 +03:00
|
|
|
zio->io_private = hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_read_done(zio);
|
|
|
|
} else {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
/*
|
|
|
|
* Buffer didn't survive caching. Increment stats and
|
|
|
|
* reissue to the original storage device.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio->io_error != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_io_error);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2013-03-08 22:41:28 +04:00
|
|
|
zio->io_error = SET_ERROR(EIO);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!valid_cksum)
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_cksum_bad);
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If there's no waiter, issue an async i/o to the primary
|
|
|
|
* storage now. If there *is* a waiter, the caller must
|
|
|
|
* issue the i/o in a context where it's OK to block.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
if (zio->io_waiter == NULL) {
|
|
|
|
zio_t *pio = zio_unique_parent(zio);
|
|
|
|
|
|
|
|
ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
|
|
|
|
hdr->b_l1hdr.b_pdata, zio->io_size, arc_read_done,
|
|
|
|
hdr, zio->io_priority, cb->l2rcb_flags,
|
|
|
|
&cb->l2rcb_zb));
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_read_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is the list priority from which the L2ARC will search for pages to
|
|
|
|
* cache. This is used within loops (0..3) to cycle through lists in the
|
|
|
|
* desired order. This order can have a significant effect on cache
|
|
|
|
* performance.
|
|
|
|
*
|
|
|
|
* Currently the metadata lists are hit first, MFU then MRU, followed by
|
|
|
|
* the data lists. This function returns a locked list, and also returns
|
|
|
|
* the lock pointer.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
static multilist_sublist_t *
|
|
|
|
l2arc_sublist_lock(int list_num)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_t *ml = NULL;
|
|
|
|
unsigned int idx;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(list_num >= 0 && list_num <= 3);
|
|
|
|
|
|
|
|
switch (list_num) {
|
|
|
|
case 0:
|
2015-01-13 06:52:19 +03:00
|
|
|
ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 1:
|
2015-01-13 06:52:19 +03:00
|
|
|
ml = &arc_mru->arcs_list[ARC_BUFC_METADATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 2:
|
2015-01-13 06:52:19 +03:00
|
|
|
ml = &arc_mfu->arcs_list[ARC_BUFC_DATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
case 3:
|
2015-01-13 06:52:19 +03:00
|
|
|
ml = &arc_mru->arcs_list[ARC_BUFC_DATA];
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/*
|
|
|
|
* Return a randomly-selected sublist. This is acceptable
|
|
|
|
* because the caller feeds only a little bit of data for each
|
|
|
|
* call (8MB). Subsequent calls will result in different
|
|
|
|
* sublists being selected.
|
|
|
|
*/
|
|
|
|
idx = multilist_get_random_index(ml);
|
|
|
|
return (multilist_sublist_lock(ml, idx));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evict buffers from the device write hand to the distance specified in
|
|
|
|
* bytes. This distance may span populated buffers, it may span nothing.
|
|
|
|
* This is clearing a region on the L2ARC device ready for writing.
|
|
|
|
* If the 'all' boolean is set, every buffer is evicted.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
|
|
|
|
{
|
|
|
|
list_t *buflist;
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr, *hdr_prev;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t taddr;
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
buflist = &dev->l2ad_buflist;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (!all && dev->l2ad_first) {
|
|
|
|
/*
|
|
|
|
* This is the first sweep through the device. There is
|
|
|
|
* nothing to evict.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* When nearing the end of the device, evict to the end
|
|
|
|
* before the device write hand jumps to the start.
|
|
|
|
*/
|
|
|
|
taddr = dev->l2ad_end;
|
|
|
|
} else {
|
|
|
|
taddr = dev->l2ad_hand + distance;
|
|
|
|
}
|
|
|
|
DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
|
|
|
|
uint64_t, taddr, boolean_t, all);
|
|
|
|
|
|
|
|
top:
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-06 20:24:32 +03:00
|
|
|
for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {
|
|
|
|
hdr_prev = list_prev(buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We cannot use mutex_enter or else we can deadlock
|
|
|
|
* with l2arc_write_buffers (due to swapping the order
|
|
|
|
* the hash lock and l2ad_mtx are taken).
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
|
|
|
* Missed the hash lock. Retry.
|
|
|
|
*/
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (HDR_L2_WRITE_HEAD(hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We hit a write head node. Leave it for
|
|
|
|
* l2arc_write_done().
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
list_remove(buflist, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
if (!all && HDR_HAS_L2HDR(hdr) &&
|
|
|
|
(hdr->b_l2hdr.b_daddr > taddr ||
|
|
|
|
hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* We've evicted to the target address,
|
|
|
|
* or the end of the device.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(HDR_HAS_L2HDR(hdr));
|
|
|
|
if (!HDR_HAS_L1HDR(hdr)) {
|
2014-12-06 20:24:32 +03:00
|
|
|
ASSERT(!HDR_L2_READING(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This doesn't exist in the ARC. Destroy.
|
|
|
|
* arc_hdr_destroy() will call list_remove()
|
|
|
|
* and decrement arcstat_l2_size.
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
arc_hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2014-12-30 06:12:23 +03:00
|
|
|
ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_l1cached);
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Invalidate issued or about to be issued
|
|
|
|
* reads, since we may be about to write
|
|
|
|
* over this location.
|
|
|
|
*/
|
2014-12-06 20:24:32 +03:00
|
|
|
if (HDR_L2_READING(hdr)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_reading);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
/* Ensure this header has finished being written */
|
|
|
|
ASSERT(!HDR_L2_WRITING(hdr));
|
2015-06-16 02:12:19 +03:00
|
|
|
|
|
|
|
arc_hdr_l2hdr_destroy(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find and write ARC buffers to the L2ARC device.
|
|
|
|
*
|
2014-12-06 20:24:32 +03:00
|
|
|
* An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid
|
2008-11-20 23:01:55 +03:00
|
|
|
* for reading until they have completed writing.
|
2013-08-02 00:02:10 +04:00
|
|
|
* The headroom_boost is an in-out parameter used to maintain headroom boost
|
|
|
|
* state between calls to this function.
|
|
|
|
*
|
|
|
|
* Returns the number of bytes actually written (which may be smaller than
|
|
|
|
* the delta by which the device hand has changed due to alignment).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
static uint64_t
|
2016-06-02 07:04:53 +03:00
|
|
|
l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_buf_hdr_t *hdr, *hdr_prev, *head;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t write_asize, write_psize, write_sz, headroom;
|
2013-08-02 00:02:10 +04:00
|
|
|
boolean_t full;
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_write_callback_t *cb;
|
|
|
|
zio_t *pio, *wzio;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t guid = spa_load_guid(spa);
|
2010-08-26 20:52:39 +04:00
|
|
|
int try;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(dev->l2ad_vdev, !=, NULL);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
pio = NULL;
|
2016-06-02 07:04:53 +03:00
|
|
|
write_sz = write_asize = write_psize = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
full = B_FALSE;
|
2014-12-30 06:12:23 +03:00
|
|
|
head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Copy buffers for L2ARC writing.
|
|
|
|
*/
|
2010-08-26 20:52:39 +04:00
|
|
|
for (try = 0; try <= 3; try++) {
|
2015-01-13 06:52:19 +03:00
|
|
|
multilist_sublist_t *mls = l2arc_sublist_lock(try);
|
2013-08-02 00:02:10 +04:00
|
|
|
uint64_t passed_sz = 0;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* L2ARC fast warmup.
|
|
|
|
*
|
|
|
|
* Until the ARC is warm and starts to evict, read from the
|
|
|
|
* head of the ARC lists rather than the tail.
|
|
|
|
*/
|
|
|
|
if (arc_warm == B_FALSE)
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr = multilist_sublist_head(mls);
|
2008-12-03 23:09:06 +03:00
|
|
|
else
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr = multilist_sublist_tail(mls);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
headroom = target_sz * l2arc_headroom;
|
2016-06-02 07:04:53 +03:00
|
|
|
if (zfs_compressed_arc_enabled)
|
2013-08-02 00:02:10 +04:00
|
|
|
headroom = (headroom * l2arc_headroom_boost) / 100;
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
for (; hdr; hdr = hdr_prev) {
|
2013-08-02 00:02:10 +04:00
|
|
|
kmutex_t *hash_lock;
|
2016-06-02 07:04:53 +03:00
|
|
|
uint64_t asize, size;
|
|
|
|
void *to_write;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (arc_warm == B_FALSE)
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr_prev = multilist_sublist_next(mls, hdr);
|
2008-12-03 23:09:06 +03:00
|
|
|
else
|
2015-01-13 06:52:19 +03:00
|
|
|
hdr_prev = multilist_sublist_prev(mls, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2013-08-02 00:02:10 +04:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Skip this buffer rather than waiting.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
passed_sz += HDR_GET_LSIZE(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (passed_sz > headroom) {
|
|
|
|
/*
|
|
|
|
* Searched too far.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-12-06 20:24:32 +03:00
|
|
|
if (!l2arc_write_eligible(guid, hdr)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
if ((write_asize + HDR_GET_LSIZE(hdr)) > target_sz) {
|
2008-11-20 23:01:55 +03:00
|
|
|
full = B_TRUE;
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pio == NULL) {
|
|
|
|
/*
|
|
|
|
* Insert a dummy header on the buflist so
|
|
|
|
* l2arc_write_done() can find where the
|
|
|
|
* write buffers begin without searching.
|
|
|
|
*/
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_insert_head(&dev->l2ad_buflist, head);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-06-29 20:02:03 +03:00
|
|
|
cb = kmem_alloc(
|
|
|
|
sizeof (l2arc_write_callback_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2wcb_dev = dev;
|
|
|
|
cb->l2wcb_head = head;
|
|
|
|
pio = zio_root(spa, l2arc_write_done, cb,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
}
|
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
hdr->b_l2hdr.b_dev = dev;
|
|
|
|
hdr->b_l2hdr.b_hits = 0;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
|
|
|
|
arc_hdr_set_flags(hdr,
|
|
|
|
ARC_FLAG_L2_WRITING | ARC_FLAG_HAS_L2HDR);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_enter(&dev->l2ad_mtx);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_insert_head(&dev->l2ad_buflist, hdr);
|
2015-01-13 06:52:19 +03:00
|
|
|
mutex_exit(&dev->l2ad_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* We rely on the L1 portion of the header below, so
|
|
|
|
* it's invalid for this header to have been evicted out
|
|
|
|
* of the ghost cache, prior to being written out. The
|
|
|
|
* ARC_FLAG_L2_WRITING bit ensures this won't happen.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT(HDR_HAS_L1HDR(hdr));
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);
|
|
|
|
ASSERT3P(hdr->b_l1hdr.b_pdata, !=, NULL);
|
|
|
|
ASSERT3U(arc_hdr_size(hdr), >, 0);
|
|
|
|
size = arc_hdr_size(hdr);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
(void) refcount_add_many(&dev->l2ad_alloc, size, hdr);
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2016-02-10 21:42:01 +03:00
|
|
|
/*
|
2016-06-02 07:04:53 +03:00
|
|
|
* Normally the L2ARC can use the hdr's data, but if
|
|
|
|
* we're sharing data between the hdr and one of its
|
|
|
|
* bufs, L2ARC needs its own copy of the data so that
|
|
|
|
* the ZIO below can't race with the buf consumer. To
|
|
|
|
* ensure that this copy will be available for the
|
|
|
|
* lifetime of the ZIO and be cleaned up afterwards, we
|
|
|
|
* add it to the l2arc_free_on_write queue.
|
2016-02-10 21:42:01 +03:00
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
if (!HDR_SHARED_DATA(hdr)) {
|
|
|
|
to_write = hdr->b_l1hdr.b_pdata;
|
|
|
|
} else {
|
|
|
|
arc_buf_contents_t type = arc_buf_type(hdr);
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
to_write = zio_buf_alloc(size);
|
|
|
|
} else {
|
|
|
|
ASSERT3U(type, ==, ARC_BUFC_DATA);
|
|
|
|
to_write = zio_data_buf_alloc(size);
|
|
|
|
}
|
2016-02-10 21:42:01 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
bcopy(hdr->b_l1hdr.b_pdata, to_write, size);
|
|
|
|
l2arc_free_data_on_write(to_write, size, type);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
wzio = zio_write_phys(pio, dev->l2ad_vdev,
|
2016-06-02 07:04:53 +03:00
|
|
|
hdr->b_l2hdr.b_daddr, size, to_write,
|
|
|
|
ZIO_CHECKSUM_OFF, NULL, hdr,
|
|
|
|
ZIO_PRIORITY_ASYNC_WRITE,
|
2008-11-20 23:01:55 +03:00
|
|
|
ZIO_FLAG_CANFAIL, B_FALSE);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
write_sz += HDR_GET_LSIZE(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
|
|
|
|
zio_t *, wzio);
|
2015-06-16 02:12:19 +03:00
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
write_asize += size;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Keep the clock hand suitably device-aligned.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
asize = vdev_psize_to_asize(dev->l2ad_vdev, size);
|
|
|
|
write_psize += asize;
|
|
|
|
dev->l2ad_hand += asize;
|
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
(void) zio_nowait(wzio);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
|
|
|
|
multilist_sublist_unlock(mls);
|
|
|
|
|
|
|
|
if (full == B_TRUE)
|
|
|
|
break;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
/* No buffers selected for writing? */
|
|
|
|
if (pio == NULL) {
|
|
|
|
ASSERT0(write_sz);
|
|
|
|
ASSERT(!HDR_HAS_L1HDR(head));
|
|
|
|
kmem_cache_free(hdr_l2only_cache, head);
|
|
|
|
return (0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
ASSERT3U(write_asize, <=, target_sz);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_sent);
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_write_bytes, write_asize);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_size, write_sz);
|
2016-06-02 07:04:53 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, write_asize);
|
|
|
|
vdev_space_update(dev->l2ad_vdev, write_asize, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Bump device hand to the device start if it is approaching the end.
|
|
|
|
* l2arc_evict() will already have evicted ahead for this case.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
dev->l2ad_hand = dev->l2ad_start;
|
|
|
|
dev->l2ad_first = B_FALSE;
|
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) zio_wait(pio);
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_FALSE;
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
return (write_asize);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This thread feeds the L2ARC at regular intervals. This is the beating
|
|
|
|
* heart of the L2ARC.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_feed_thread(void)
|
|
|
|
{
|
|
|
|
callb_cpr_t cpr;
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
spa_t *spa;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t size, wrote;
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t begin, next = ddi_get_lbolt();
|
2015-03-31 06:43:29 +03:00
|
|
|
fstrans_cookie_t cookie;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
|
2015-03-31 06:43:29 +03:00
|
|
|
cookie = spl_fstrans_mark();
|
2008-11-20 23:01:55 +03:00
|
|
|
while (l2arc_thread_exit == 0) {
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2015-06-11 20:47:19 +03:00
|
|
|
(void) cv_timedwait_sig(&l2arc_feed_thr_cv,
|
2010-12-10 23:00:00 +03:00
|
|
|
&l2arc_feed_thr_lock, next);
|
2008-11-20 23:01:55 +03:00
|
|
|
CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
next = ddi_get_lbolt() + hz;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Quick check for L2ARC devices.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
if (l2arc_ndev == 0) {
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
begin = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* This selects the next l2arc device to write to, and in
|
|
|
|
* doing so the next spa to feed from: dev->l2ad_spa. This
|
|
|
|
* will return NULL if there are now no l2arc devices or if
|
|
|
|
* they are all faulted.
|
|
|
|
*
|
|
|
|
* If a device is returned, its spa's config lock is also
|
|
|
|
* held to prevent device removal. l2arc_dev_get_next()
|
|
|
|
* will grab and release l2arc_dev_mtx.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((dev = l2arc_dev_get_next()) == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
spa = dev->l2ad_spa;
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(spa, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If the pool is read-only then force the feed thread to
|
|
|
|
* sleep a little longer.
|
|
|
|
*/
|
|
|
|
if (!spa_writeable(spa)) {
|
|
|
|
next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Avoid contributing to memory pressure.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-06-26 21:28:18 +03:00
|
|
|
if (arc_reclaim_needed()) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_feeds);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
size = l2arc_write_size();
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Evict L2ARC buffers that will be overwritten.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_evict(dev, size, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Write ARC buffers.
|
|
|
|
*/
|
2016-06-02 07:04:53 +03:00
|
|
|
wrote = l2arc_write_buffers(spa, dev, size);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate interval between writes.
|
|
|
|
*/
|
|
|
|
next = l2arc_write_interval(begin, size, wrote);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-03-31 06:43:29 +03:00
|
|
|
spl_fstrans_unmark(cookie);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
cv_broadcast(&l2arc_feed_thr_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
boolean_t
|
|
|
|
l2arc_vdev_present(vdev_t *vd)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
for (dev = list_head(l2arc_dev_list); dev != NULL;
|
|
|
|
dev = list_next(l2arc_dev_list, dev)) {
|
|
|
|
if (dev->l2ad_vdev == vd)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
return (dev != NULL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Add a vdev for use by the L2ARC. By this point the spa has already
|
|
|
|
* validated the vdev and opened it.
|
|
|
|
*/
|
|
|
|
void
|
2009-07-03 02:44:48 +04:00
|
|
|
l2arc_add_vdev(spa_t *spa, vdev_t *vd)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
l2arc_dev_t *adddev;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(!l2arc_vdev_present(vd));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create a new l2arc device entry.
|
|
|
|
*/
|
|
|
|
adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
|
|
|
|
adddev->l2ad_spa = spa;
|
|
|
|
adddev->l2ad_vdev = vd;
|
2009-07-03 02:44:48 +04:00
|
|
|
adddev->l2ad_start = VDEV_LABEL_START_SIZE;
|
|
|
|
adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
adddev->l2ad_hand = adddev->l2ad_start;
|
|
|
|
adddev->l2ad_first = B_TRUE;
|
2009-02-18 23:51:31 +03:00
|
|
|
adddev->l2ad_writing = B_FALSE;
|
2010-08-26 21:26:44 +04:00
|
|
|
list_link_init(&adddev->l2ad_node);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-12-30 06:12:23 +03:00
|
|
|
mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This is a list of all ARC buffers that are still valid on the
|
|
|
|
* device.
|
|
|
|
*/
|
2014-12-30 06:12:23 +03:00
|
|
|
list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
|
2015-06-16 02:12:19 +03:00
|
|
|
refcount_create(&adddev->l2ad_alloc);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add device to global list
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
list_insert_head(l2arc_dev_list, adddev);
|
|
|
|
atomic_inc_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a vdev from the L2ARC.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
l2arc_remove_vdev(vdev_t *vd)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev, *nextdev, *remdev = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the device by vdev
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
|
|
|
|
nextdev = list_next(l2arc_dev_list, dev);
|
|
|
|
if (vd == dev->l2ad_vdev) {
|
|
|
|
remdev = dev;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2016-06-02 07:04:53 +03:00
|
|
|
ASSERT3P(remdev, !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove device from global list
|
|
|
|
*/
|
|
|
|
list_remove(l2arc_dev_list, remdev);
|
|
|
|
l2arc_dev_last = NULL; /* may have been invalidated */
|
2008-12-03 23:09:06 +03:00
|
|
|
atomic_dec_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear all buflists and ARC references. L2ARC device flush.
|
|
|
|
*/
|
|
|
|
l2arc_evict(remdev, 0, B_TRUE);
|
2014-12-30 06:12:23 +03:00
|
|
|
list_destroy(&remdev->l2ad_buflist);
|
|
|
|
mutex_destroy(&remdev->l2ad_mtx);
|
2015-06-16 02:12:19 +03:00
|
|
|
refcount_destroy(&remdev->l2ad_alloc);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(remdev, sizeof (l2arc_dev_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
l2arc_ndev = 0;
|
|
|
|
l2arc_writes_sent = 0;
|
|
|
|
l2arc_writes_done = 0;
|
|
|
|
|
|
|
|
mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
|
|
|
l2arc_dev_list = &L2ARC_dev_list;
|
|
|
|
l2arc_free_on_write = &L2ARC_free_on_write;
|
|
|
|
list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
|
|
|
|
offsetof(l2arc_dev_t, l2ad_node));
|
|
|
|
list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
|
|
|
|
offsetof(l2arc_data_free_t, l2df_list_node));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_fini(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* This is called from dmu_fini(), which is called from spa_fini();
|
|
|
|
* Because of this, we can assume that all l2arc devices have
|
|
|
|
* already been removed when the pools themselves were removed.
|
|
|
|
*/
|
|
|
|
|
|
|
|
l2arc_do_free_on_write();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_destroy(&l2arc_feed_thr_lock);
|
|
|
|
cv_destroy(&l2arc_feed_thr_cv);
|
|
|
|
mutex_destroy(&l2arc_dev_mtx);
|
|
|
|
mutex_destroy(&l2arc_free_on_write_mtx);
|
|
|
|
|
|
|
|
list_destroy(l2arc_dev_list);
|
|
|
|
list_destroy(l2arc_free_on_write);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_start(void)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!(spa_mode_global & FWRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
|
2015-07-24 20:08:31 +03:00
|
|
|
TS_RUN, defclsyspri);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_stop(void)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!(spa_mode_global & FWRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
|
|
|
|
l2arc_thread_exit = 1;
|
|
|
|
while (l2arc_thread_exit != 0)
|
|
|
|
cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
|
|
|
|
mutex_exit(&l2arc_feed_thr_lock);
|
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
2014-11-13 21:09:05 +03:00
|
|
|
EXPORT_SYMBOL(arc_buf_size);
|
|
|
|
EXPORT_SYMBOL(arc_write);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(arc_read);
|
2013-10-03 04:11:19 +04:00
|
|
|
EXPORT_SYMBOL(arc_buf_info);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(arc_getbuf_func);
|
2011-12-23 00:20:43 +04:00
|
|
|
EXPORT_SYMBOL(arc_add_prune_callback);
|
|
|
|
EXPORT_SYMBOL(arc_remove_prune_callback);
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_min, ulong, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_min, "Min arc size");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_max, ulong, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_max, "Max arc size");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_meta_limit, ulong, 0644);
|
2010-08-26 22:49:16 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_limit, "Meta limit for arc size");
|
2011-03-31 05:59:17 +04:00
|
|
|
|
2016-08-11 06:15:37 +03:00
|
|
|
module_param(zfs_arc_meta_limit_percent, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_limit_percent,
|
|
|
|
"Percent of arc size for arc meta limit");
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
module_param(zfs_arc_meta_min, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_min, "Min arc metadata");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_meta_prune, int, 0644);
|
2015-03-18 01:07:47 +03:00
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_prune, "Meta objects to scan for prune");
|
2011-05-04 02:09:28 +04:00
|
|
|
|
2015-06-26 21:28:18 +03:00
|
|
|
module_param(zfs_arc_meta_adjust_restarts, int, 0644);
|
2015-03-18 01:08:22 +03:00
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_adjust_restarts,
|
|
|
|
"Limit number of restarts in arc_adjust_meta");
|
|
|
|
|
2015-05-30 17:57:53 +03:00
|
|
|
module_param(zfs_arc_meta_strategy, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_strategy, "Meta reclaim strategy");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_grow_retry, int, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size");
|
|
|
|
|
Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-11 21:40:13 +04:00
|
|
|
module_param(zfs_arc_p_aggressive_disable, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow");
|
|
|
|
|
2014-01-03 22:36:26 +04:00
|
|
|
module_param(zfs_arc_p_dampener_disable, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_p_dampener_disable, "disable arc_p adapt dampener");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_shrink_shift, int, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)");
|
|
|
|
|
2015-06-27 01:59:23 +03:00
|
|
|
module_param(zfs_arc_p_min_shift, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_p_min_shift, "arc_c shift to calc min/max arc_p");
|
|
|
|
|
2014-08-20 21:09:40 +04:00
|
|
|
module_param(zfs_arc_average_blocksize, int, 0444);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_average_blocksize, "Target average block size");
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
module_param(zfs_compressed_arc_enabled, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_average_blocksize, "Disable compressed arc buffers");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_min_prefetch_lifespan, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_min_prefetch_lifespan, "Min life of prefetch block");
|
|
|
|
|
2015-01-13 06:52:19 +03:00
|
|
|
module_param(zfs_arc_num_sublists_per_state, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_num_sublists_per_state,
|
|
|
|
"Number of sublists used in each of the ARC state lists");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_write_max, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_write_max, "Max write bytes per interval");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_write_boost, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_write_boost, "Extra write bytes during device warmup");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_headroom, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_headroom, "Number of max device writes to precache");
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
module_param(l2arc_headroom_boost, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(l2arc_headroom_boost, "Compressed l2arc_headroom multiplier");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_secs, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_secs, "Seconds between L2ARC writing");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_min_ms, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_min_ms, "Min feed interval in milliseconds");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_noprefetch, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_noprefetch, "Skip caching prefetched buffers");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_again, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_again, "Turbo L2ARC warmup");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_norw, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_norw, "No reads during writes");
|
|
|
|
|
2015-07-28 21:30:00 +03:00
|
|
|
module_param(zfs_arc_lotsfree_percent, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_lotsfree_percent,
|
|
|
|
"System free memory I/O throttle in bytes");
|
|
|
|
|
2015-07-27 23:17:32 +03:00
|
|
|
module_param(zfs_arc_sys_free, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_sys_free, "System free memory target size in bytes");
|
|
|
|
|
2016-07-13 15:42:40 +03:00
|
|
|
module_param(zfs_arc_dnode_limit, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_dnode_limit, "Minimum bytes of dnodes in arc");
|
|
|
|
|
2016-08-11 06:15:37 +03:00
|
|
|
module_param(zfs_arc_dnode_limit_percent, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_dnode_limit_percent,
|
|
|
|
"Percent of ARC meta buffers for dnodes");
|
|
|
|
|
2016-07-13 15:42:40 +03:00
|
|
|
module_param(zfs_arc_dnode_reduce_percent, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_dnode_reduce_percent,
|
|
|
|
"Percentage of excess dnodes to try to unpin");
|
|
|
|
|
2010-08-26 22:49:16 +04:00
|
|
|
#endif
|