2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2014-06-25 22:37:59 +04:00
|
|
|
* Copyright (c) 2011, 2014 by Delphix. All rights reserved.
|
2013-08-02 00:02:10 +04:00
|
|
|
* Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
|
2014-05-22 13:11:57 +04:00
|
|
|
* Copyright 2014 Nexenta Systems, Inc. All rights reserved.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* DVA-based Adjustable Replacement Cache
|
|
|
|
*
|
|
|
|
* While much of the theory of operation used here is
|
|
|
|
* based on the self-tuning, low overhead replacement cache
|
|
|
|
* presented by Megiddo and Modha at FAST 2003, there are some
|
|
|
|
* significant differences:
|
|
|
|
*
|
|
|
|
* 1. The Megiddo and Modha model assumes any page is evictable.
|
|
|
|
* Pages in its cache cannot be "locked" into memory. This makes
|
|
|
|
* the eviction algorithm simple: evict the last page in the list.
|
|
|
|
* This also make the performance characteristics easy to reason
|
|
|
|
* about. Our cache is not so simple. At any given moment, some
|
|
|
|
* subset of the blocks in the cache are un-evictable because we
|
|
|
|
* have handed out a reference to them. Blocks are only evictable
|
|
|
|
* when there are no external references active. This makes
|
|
|
|
* eviction far more problematic: we choose to evict the evictable
|
|
|
|
* blocks that are the "lowest" in the list.
|
|
|
|
*
|
|
|
|
* There are times when it is not possible to evict the requested
|
|
|
|
* space. In these circumstances we are unable to adjust the cache
|
|
|
|
* size. To prevent the cache growing unbounded at these times we
|
|
|
|
* implement a "cache throttle" that slows the flow of new data
|
|
|
|
* into the cache until we can make space available.
|
|
|
|
*
|
|
|
|
* 2. The Megiddo and Modha model assumes a fixed cache size.
|
|
|
|
* Pages are evicted when the cache is full and there is a cache
|
|
|
|
* miss. Our model has a variable sized cache. It grows with
|
|
|
|
* high use, but also tries to react to memory pressure from the
|
|
|
|
* operating system: decreasing its size when system memory is
|
|
|
|
* tight.
|
|
|
|
*
|
|
|
|
* 3. The Megiddo and Modha model assumes a fixed page size. All
|
2013-06-11 21:12:34 +04:00
|
|
|
* elements of the cache are therefore exactly the same size. So
|
2008-11-20 23:01:55 +03:00
|
|
|
* when adjusting the cache size following a cache miss, its simply
|
|
|
|
* a matter of choosing a single page to evict. In our model, we
|
|
|
|
* have variable sized cache blocks (rangeing from 512 bytes to
|
2013-06-11 21:12:34 +04:00
|
|
|
* 128K bytes). We therefore choose a set of blocks to evict to make
|
2008-11-20 23:01:55 +03:00
|
|
|
* space for a cache miss that approximates as closely as possible
|
|
|
|
* the space used by the new block.
|
|
|
|
*
|
|
|
|
* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
|
|
|
|
* by N. Megiddo & D. Modha, FAST 2003
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The locking model:
|
|
|
|
*
|
|
|
|
* A new reference to a cache buffer can be obtained in two
|
|
|
|
* ways: 1) via a hash table lookup using the DVA as a key,
|
|
|
|
* or 2) via one of the ARC lists. The arc_read() interface
|
|
|
|
* uses method 1, while the internal arc algorithms for
|
2013-06-11 21:12:34 +04:00
|
|
|
* adjusting the cache use method 2. We therefore provide two
|
2008-11-20 23:01:55 +03:00
|
|
|
* types of locks: 1) the hash table lock array, and 2) the
|
|
|
|
* arc list locks.
|
|
|
|
*
|
2013-01-11 20:54:18 +04:00
|
|
|
* Buffers do not have their own mutexes, rather they rely on the
|
|
|
|
* hash table mutexes for the bulk of their protection (i.e. most
|
|
|
|
* fields in the arc_buf_hdr_t are protected by these mutexes).
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* buf_hash_find() returns the appropriate mutex (held) when it
|
|
|
|
* locates the requested buffer in the hash table. It returns
|
|
|
|
* NULL for the mutex if the buffer was not in the table.
|
|
|
|
*
|
|
|
|
* buf_hash_remove() expects the appropriate hash mutex to be
|
|
|
|
* already held before it is invoked.
|
|
|
|
*
|
|
|
|
* Each arc state also has a mutex which is used to protect the
|
|
|
|
* buffer list associated with the state. When attempting to
|
|
|
|
* obtain a hash table lock while holding an arc list lock you
|
|
|
|
* must use: mutex_tryenter() to avoid deadlock. Also note that
|
|
|
|
* the active state mutex must be held before the ghost state mutex.
|
|
|
|
*
|
|
|
|
* Arc buffers may have an associated eviction callback function.
|
|
|
|
* This function will be invoked prior to removing the buffer (e.g.
|
|
|
|
* in arc_do_user_evicts()). Note however that the data associated
|
|
|
|
* with the buffer may be evicted prior to the callback. The callback
|
|
|
|
* must be made with *no locks held* (to prevent deadlock). Additionally,
|
|
|
|
* the users of callbacks must ensure that their private data is
|
|
|
|
* protected from simultaneous callbacks from arc_buf_evict()
|
|
|
|
* and arc_do_user_evicts().
|
|
|
|
*
|
2011-12-23 00:20:43 +04:00
|
|
|
* It as also possible to register a callback which is run when the
|
|
|
|
* arc_meta_limit is reached and no buffers can be safely evicted. In
|
|
|
|
* this case the arc user should drop a reference on some arc buffers so
|
|
|
|
* they can be reclaimed and the arc_meta_limit honored. For example,
|
|
|
|
* when using the ZPL each dentry holds a references on a znode. These
|
|
|
|
* dentries must be pruned before the arc buffer holding the znode can
|
|
|
|
* be safely evicted.
|
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* Note that the majority of the performance stats are manipulated
|
|
|
|
* with atomic operations.
|
|
|
|
*
|
|
|
|
* The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
|
|
|
|
*
|
|
|
|
* - L2ARC buflist creation
|
|
|
|
* - L2ARC buflist eviction
|
|
|
|
* - L2ARC write completion, which walks L2ARC buflists
|
|
|
|
* - ARC header destruction, as it removes from L2ARC buflists
|
|
|
|
* - ARC header release, as it removes from L2ARC buflists
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zio.h>
|
2013-08-02 00:02:10 +04:00
|
|
|
#include <sys/zio_compress.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/arc.h>
|
2008-12-03 23:09:06 +03:00
|
|
|
#include <sys/vdev.h>
|
2009-07-03 02:44:48 +04:00
|
|
|
#include <sys/vdev_impl.h>
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
#include <sys/dsl_pool.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#ifdef _KERNEL
|
|
|
|
#include <sys/vmsystm.h>
|
|
|
|
#include <vm/anon.h>
|
|
|
|
#include <sys/fs/swapnode.h>
|
2011-12-23 00:20:43 +04:00
|
|
|
#include <sys/zpl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
#include <sys/callb.h>
|
|
|
|
#include <sys/kstat.h>
|
2012-01-20 22:58:57 +04:00
|
|
|
#include <sys/dmu_tx.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <zfs_fletcher.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-05-17 01:18:06 +04:00
|
|
|
#ifndef _KERNEL
|
|
|
|
/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */
|
|
|
|
boolean_t arc_watch = B_FALSE;
|
|
|
|
#endif
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static kmutex_t arc_reclaim_thr_lock;
|
|
|
|
static kcondvar_t arc_reclaim_thr_cv; /* used to signal reclaim thr */
|
|
|
|
static uint8_t arc_thread_exit;
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
/* number of bytes to prune from caches when at arc_meta_limit is reached */
|
2013-07-24 21:14:11 +04:00
|
|
|
int zfs_arc_meta_prune = 1048576;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
typedef enum arc_reclaim_strategy {
|
|
|
|
ARC_RECLAIM_AGGR, /* Aggressive reclaim strategy */
|
|
|
|
ARC_RECLAIM_CONS /* Conservative reclaim strategy */
|
|
|
|
} arc_reclaim_strategy_t;
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* The number of iterations through arc_evict_*() before we
|
|
|
|
* drop & reacquire the lock.
|
|
|
|
*/
|
|
|
|
int arc_evict_iterations = 100;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* number of seconds before growing cache again */
|
2013-07-24 21:14:11 +04:00
|
|
|
int zfs_arc_grow_retry = 5;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-11 21:40:13 +04:00
|
|
|
/* disable anon data aggressively growing arc_p */
|
|
|
|
int zfs_arc_p_aggressive_disable = 1;
|
|
|
|
|
2014-01-03 22:36:26 +04:00
|
|
|
/* disable arc_p adapt dampener in arc_adapt */
|
|
|
|
int zfs_arc_p_dampener_disable = 1;
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/* log2(fraction of arc to reclaim) */
|
2013-07-24 21:14:11 +04:00
|
|
|
int zfs_arc_shrink_shift = 5;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* minimum lifespan of a prefetch block in clock ticks
|
|
|
|
* (initialized in arc_init())
|
|
|
|
*/
|
2013-07-24 21:14:11 +04:00
|
|
|
int zfs_arc_min_prefetch_lifespan = HZ;
|
|
|
|
|
|
|
|
/* disable arc proactive arc throttle due to low memory */
|
|
|
|
int zfs_arc_memory_throttle_disable = 1;
|
|
|
|
|
|
|
|
/* disable duplicate buffer eviction */
|
|
|
|
int zfs_disable_dup_eviction = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* If this percent of memory is free, don't throttle.
|
|
|
|
*/
|
|
|
|
int arc_lotsfree_percent = 10;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static int arc_dead;
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
/* expiration time for arc_no_grow */
|
|
|
|
static clock_t arc_grow_time = 0;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* The arc has filled available memory and has now warmed up.
|
|
|
|
*/
|
|
|
|
static boolean_t arc_warm;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* These tunables are for performance analysis.
|
|
|
|
*/
|
2010-08-26 22:49:16 +04:00
|
|
|
unsigned long zfs_arc_max = 0;
|
|
|
|
unsigned long zfs_arc_min = 0;
|
|
|
|
unsigned long zfs_arc_meta_limit = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note that buffers can be in one of 6 states:
|
|
|
|
* ARC_anon - anonymous (discussed below)
|
|
|
|
* ARC_mru - recently used, currently cached
|
|
|
|
* ARC_mru_ghost - recentely used, no longer in cache
|
|
|
|
* ARC_mfu - frequently used, currently cached
|
|
|
|
* ARC_mfu_ghost - frequently used, no longer in cache
|
|
|
|
* ARC_l2c_only - exists in L2ARC but not other states
|
|
|
|
* When there are no active references to the buffer, they are
|
|
|
|
* are linked onto a list in one of these arc states. These are
|
|
|
|
* the only buffers that can be evicted or deleted. Within each
|
|
|
|
* state there are multiple lists, one for meta-data and one for
|
|
|
|
* non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
|
|
|
|
* etc.) is tracked separately so that it can be managed more
|
|
|
|
* explicitly: favored over data, limited explicitly.
|
|
|
|
*
|
|
|
|
* Anonymous buffers are buffers that are not associated with
|
|
|
|
* a DVA. These are buffers that hold dirty block copies
|
|
|
|
* before they are written to stable storage. By definition,
|
|
|
|
* they are "ref'd" and are considered part of arc_mru
|
|
|
|
* that cannot be freed. Generally, they will aquire a DVA
|
|
|
|
* as they are written and migrate onto the arc_mru list.
|
|
|
|
*
|
|
|
|
* The ARC_l2c_only state is for buffers that are in the second
|
|
|
|
* level ARC but no longer in any of the ARC_m* lists. The second
|
|
|
|
* level ARC itself may also contain buffers that are in any of
|
|
|
|
* the ARC_m* states - meaning that a buffer can exist in two
|
|
|
|
* places. The reason for the ARC_l2c_only state is to keep the
|
|
|
|
* buffer header in the hash table, so that reads that hit the
|
|
|
|
* second level ARC benefit from these fast lookups.
|
|
|
|
*/
|
|
|
|
|
|
|
|
typedef struct arc_state {
|
|
|
|
list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */
|
|
|
|
uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
|
|
|
|
uint64_t arcs_size; /* total amount of data in this state */
|
|
|
|
kmutex_t arcs_mtx;
|
2013-10-03 04:11:19 +04:00
|
|
|
arc_state_type_t arcs_state;
|
2008-11-20 23:01:55 +03:00
|
|
|
} arc_state_t;
|
|
|
|
|
|
|
|
/* The 6 states: */
|
|
|
|
static arc_state_t ARC_anon;
|
|
|
|
static arc_state_t ARC_mru;
|
|
|
|
static arc_state_t ARC_mru_ghost;
|
|
|
|
static arc_state_t ARC_mfu;
|
|
|
|
static arc_state_t ARC_mfu_ghost;
|
|
|
|
static arc_state_t ARC_l2c_only;
|
|
|
|
|
|
|
|
typedef struct arc_stats {
|
|
|
|
kstat_named_t arcstat_hits;
|
|
|
|
kstat_named_t arcstat_misses;
|
|
|
|
kstat_named_t arcstat_demand_data_hits;
|
|
|
|
kstat_named_t arcstat_demand_data_misses;
|
|
|
|
kstat_named_t arcstat_demand_metadata_hits;
|
|
|
|
kstat_named_t arcstat_demand_metadata_misses;
|
|
|
|
kstat_named_t arcstat_prefetch_data_hits;
|
|
|
|
kstat_named_t arcstat_prefetch_data_misses;
|
|
|
|
kstat_named_t arcstat_prefetch_metadata_hits;
|
|
|
|
kstat_named_t arcstat_prefetch_metadata_misses;
|
|
|
|
kstat_named_t arcstat_mru_hits;
|
|
|
|
kstat_named_t arcstat_mru_ghost_hits;
|
|
|
|
kstat_named_t arcstat_mfu_hits;
|
|
|
|
kstat_named_t arcstat_mfu_ghost_hits;
|
|
|
|
kstat_named_t arcstat_deleted;
|
|
|
|
kstat_named_t arcstat_recycle_miss;
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* Number of buffers that could not be evicted because the hash lock
|
|
|
|
* was held by another thread. The lock may not necessarily be held
|
|
|
|
* by something using the same buffer, since hash locks are shared
|
|
|
|
* by multiple buffers.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_mutex_miss;
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* Number of buffers skipped because they have I/O in progress, are
|
|
|
|
* indrect prefetch buffers that have not lived long enough, or are
|
|
|
|
* not from the spa we're trying to evict from.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_evict_skip;
|
2010-05-29 00:45:14 +04:00
|
|
|
kstat_named_t arcstat_evict_l2_cached;
|
|
|
|
kstat_named_t arcstat_evict_l2_eligible;
|
|
|
|
kstat_named_t arcstat_evict_l2_ineligible;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_hash_elements;
|
|
|
|
kstat_named_t arcstat_hash_elements_max;
|
|
|
|
kstat_named_t arcstat_hash_collisions;
|
|
|
|
kstat_named_t arcstat_hash_chains;
|
|
|
|
kstat_named_t arcstat_hash_chain_max;
|
|
|
|
kstat_named_t arcstat_p;
|
|
|
|
kstat_named_t arcstat_c;
|
|
|
|
kstat_named_t arcstat_c_min;
|
|
|
|
kstat_named_t arcstat_c_max;
|
|
|
|
kstat_named_t arcstat_size;
|
|
|
|
kstat_named_t arcstat_hdr_size;
|
2009-02-18 23:51:31 +03:00
|
|
|
kstat_named_t arcstat_data_size;
|
2014-02-04 00:41:47 +04:00
|
|
|
kstat_named_t arcstat_meta_size;
|
2009-02-18 23:51:31 +03:00
|
|
|
kstat_named_t arcstat_other_size;
|
2012-01-31 01:28:40 +04:00
|
|
|
kstat_named_t arcstat_anon_size;
|
|
|
|
kstat_named_t arcstat_anon_evict_data;
|
|
|
|
kstat_named_t arcstat_anon_evict_metadata;
|
|
|
|
kstat_named_t arcstat_mru_size;
|
|
|
|
kstat_named_t arcstat_mru_evict_data;
|
|
|
|
kstat_named_t arcstat_mru_evict_metadata;
|
|
|
|
kstat_named_t arcstat_mru_ghost_size;
|
|
|
|
kstat_named_t arcstat_mru_ghost_evict_data;
|
|
|
|
kstat_named_t arcstat_mru_ghost_evict_metadata;
|
|
|
|
kstat_named_t arcstat_mfu_size;
|
|
|
|
kstat_named_t arcstat_mfu_evict_data;
|
|
|
|
kstat_named_t arcstat_mfu_evict_metadata;
|
|
|
|
kstat_named_t arcstat_mfu_ghost_size;
|
|
|
|
kstat_named_t arcstat_mfu_ghost_evict_data;
|
|
|
|
kstat_named_t arcstat_mfu_ghost_evict_metadata;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_hits;
|
|
|
|
kstat_named_t arcstat_l2_misses;
|
|
|
|
kstat_named_t arcstat_l2_feeds;
|
|
|
|
kstat_named_t arcstat_l2_rw_clash;
|
2009-02-18 23:51:31 +03:00
|
|
|
kstat_named_t arcstat_l2_read_bytes;
|
|
|
|
kstat_named_t arcstat_l2_write_bytes;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_writes_sent;
|
|
|
|
kstat_named_t arcstat_l2_writes_done;
|
|
|
|
kstat_named_t arcstat_l2_writes_error;
|
|
|
|
kstat_named_t arcstat_l2_writes_hdr_miss;
|
|
|
|
kstat_named_t arcstat_l2_evict_lock_retry;
|
|
|
|
kstat_named_t arcstat_l2_evict_reading;
|
|
|
|
kstat_named_t arcstat_l2_free_on_write;
|
|
|
|
kstat_named_t arcstat_l2_abort_lowmem;
|
|
|
|
kstat_named_t arcstat_l2_cksum_bad;
|
|
|
|
kstat_named_t arcstat_l2_io_error;
|
|
|
|
kstat_named_t arcstat_l2_size;
|
2013-08-02 00:02:10 +04:00
|
|
|
kstat_named_t arcstat_l2_asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_l2_hdr_size;
|
2013-08-02 00:02:10 +04:00
|
|
|
kstat_named_t arcstat_l2_compress_successes;
|
|
|
|
kstat_named_t arcstat_l2_compress_zeros;
|
|
|
|
kstat_named_t arcstat_l2_compress_failures;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_named_t arcstat_memory_throttle_count;
|
2012-12-22 02:57:09 +04:00
|
|
|
kstat_named_t arcstat_duplicate_buffers;
|
|
|
|
kstat_named_t arcstat_duplicate_buffers_size;
|
|
|
|
kstat_named_t arcstat_duplicate_reads;
|
2011-03-30 05:08:59 +04:00
|
|
|
kstat_named_t arcstat_memory_direct_count;
|
|
|
|
kstat_named_t arcstat_memory_indirect_count;
|
2011-03-24 22:13:55 +03:00
|
|
|
kstat_named_t arcstat_no_grow;
|
|
|
|
kstat_named_t arcstat_tempreserve;
|
|
|
|
kstat_named_t arcstat_loaned_bytes;
|
2011-12-23 00:20:43 +04:00
|
|
|
kstat_named_t arcstat_prune;
|
2011-03-24 22:13:55 +03:00
|
|
|
kstat_named_t arcstat_meta_used;
|
|
|
|
kstat_named_t arcstat_meta_limit;
|
|
|
|
kstat_named_t arcstat_meta_max;
|
2008-11-20 23:01:55 +03:00
|
|
|
} arc_stats_t;
|
|
|
|
|
|
|
|
static arc_stats_t arc_stats = {
|
|
|
|
{ "hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_data_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "deleted", KSTAT_DATA_UINT64 },
|
|
|
|
{ "recycle_miss", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mutex_miss", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_skip", KSTAT_DATA_UINT64 },
|
2010-05-29 00:45:14 +04:00
|
|
|
{ "evict_l2_cached", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
|
|
|
|
{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "hash_elements", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_elements_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_collisions", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chains", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hash_chain_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "p", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_min", KSTAT_DATA_UINT64 },
|
|
|
|
{ "c_max", KSTAT_DATA_UINT64 },
|
|
|
|
{ "size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "hdr_size", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "data_size", KSTAT_DATA_UINT64 },
|
2014-02-04 00:41:47 +04:00
|
|
|
{ "meta_size", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "other_size", KSTAT_DATA_UINT64 },
|
2012-01-31 01:28:40 +04:00
|
|
|
{ "anon_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "anon_evict_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "anon_evict_metadata", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_evict_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_evict_metadata", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_evict_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mru_ghost_evict_metadata", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_evict_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_evict_metadata", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_evict_data", KSTAT_DATA_UINT64 },
|
|
|
|
{ "mfu_ghost_evict_metadata", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hits", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_misses", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_feeds", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_rw_clash", KSTAT_DATA_UINT64 },
|
2009-02-18 23:51:31 +03:00
|
|
|
{ "l2_read_bytes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_write_bytes", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_writes_sent", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_done", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_error", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_writes_hdr_miss", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_evict_reading", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_free_on_write", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_io_error", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_size", KSTAT_DATA_UINT64 },
|
2013-08-02 00:02:10 +04:00
|
|
|
{ "l2_asize", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
{ "l2_hdr_size", KSTAT_DATA_UINT64 },
|
2013-08-02 00:02:10 +04:00
|
|
|
{ "l2_compress_successes", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_compress_zeros", KSTAT_DATA_UINT64 },
|
|
|
|
{ "l2_compress_failures", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "memory_throttle_count", KSTAT_DATA_UINT64 },
|
2012-12-22 02:57:09 +04:00
|
|
|
{ "duplicate_buffers", KSTAT_DATA_UINT64 },
|
|
|
|
{ "duplicate_buffers_size", KSTAT_DATA_UINT64 },
|
|
|
|
{ "duplicate_reads", KSTAT_DATA_UINT64 },
|
2011-03-30 05:08:59 +04:00
|
|
|
{ "memory_direct_count", KSTAT_DATA_UINT64 },
|
|
|
|
{ "memory_indirect_count", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_no_grow", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_tempreserve", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_loaned_bytes", KSTAT_DATA_UINT64 },
|
2011-12-23 00:20:43 +04:00
|
|
|
{ "arc_prune", KSTAT_DATA_UINT64 },
|
2011-03-24 22:13:55 +03:00
|
|
|
{ "arc_meta_used", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_meta_limit", KSTAT_DATA_UINT64 },
|
|
|
|
{ "arc_meta_max", KSTAT_DATA_UINT64 },
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
#define ARCSTAT(stat) (arc_stats.stat.value.ui64)
|
|
|
|
|
|
|
|
#define ARCSTAT_INCR(stat, val) \
|
2013-06-11 21:12:34 +04:00
|
|
|
atomic_add_64(&arc_stats.stat.value.ui64, (val))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
|
2008-11-20 23:01:55 +03:00
|
|
|
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
|
|
|
|
|
|
|
|
#define ARCSTAT_MAX(stat, val) { \
|
|
|
|
uint64_t m; \
|
|
|
|
while ((val) > (m = arc_stats.stat.value.ui64) && \
|
|
|
|
(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
|
|
|
|
continue; \
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ARCSTAT_MAXSTAT(stat) \
|
|
|
|
ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We define a macro to allow ARC hits/misses to be easily broken down by
|
|
|
|
* two separate conditions, giving a total of four different subtypes for
|
|
|
|
* each of hits and misses (so eight statistics total).
|
|
|
|
*/
|
|
|
|
#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
|
|
|
|
if (cond1) { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
|
|
|
|
} \
|
|
|
|
} else { \
|
|
|
|
if (cond2) { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
|
|
|
|
} else { \
|
|
|
|
ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
|
|
|
|
} \
|
|
|
|
}
|
|
|
|
|
|
|
|
kstat_t *arc_ksp;
|
2010-05-29 00:45:14 +04:00
|
|
|
static arc_state_t *arc_anon;
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_state_t *arc_mru;
|
|
|
|
static arc_state_t *arc_mru_ghost;
|
|
|
|
static arc_state_t *arc_mfu;
|
|
|
|
static arc_state_t *arc_mfu_ghost;
|
|
|
|
static arc_state_t *arc_l2c_only;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There are several ARC variables that are critical to export as kstats --
|
|
|
|
* but we don't want to have to grovel around in the kstat whenever we wish to
|
|
|
|
* manipulate them. For these variables, we therefore define them to be in
|
|
|
|
* terms of the statistic variable. This assures that we are not introducing
|
|
|
|
* the possibility of inconsistency by having shadow copies of the variables,
|
|
|
|
* while still allowing the code to be readable.
|
|
|
|
*/
|
|
|
|
#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
|
|
|
|
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
|
|
|
|
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
|
|
|
|
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
|
|
|
|
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
|
2011-03-24 22:13:55 +03:00
|
|
|
#define arc_no_grow ARCSTAT(arcstat_no_grow)
|
|
|
|
#define arc_tempreserve ARCSTAT(arcstat_tempreserve)
|
|
|
|
#define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes)
|
2013-02-18 00:00:54 +04:00
|
|
|
#define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */
|
|
|
|
#define arc_meta_used ARCSTAT(arcstat_meta_used) /* size of metadata */
|
|
|
|
#define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
#define L2ARC_IS_VALID_COMPRESS(_c_) \
|
|
|
|
((_c_) == ZIO_COMPRESS_LZ4 || (_c_) == ZIO_COMPRESS_EMPTY)
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
|
|
|
|
|
|
|
|
typedef struct arc_callback arc_callback_t;
|
|
|
|
|
|
|
|
struct arc_callback {
|
|
|
|
void *acb_private;
|
|
|
|
arc_done_func_t *acb_done;
|
|
|
|
arc_buf_t *acb_buf;
|
|
|
|
zio_t *acb_zio_dummy;
|
|
|
|
arc_callback_t *acb_next;
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef struct arc_write_callback arc_write_callback_t;
|
|
|
|
|
|
|
|
struct arc_write_callback {
|
|
|
|
void *awcb_private;
|
|
|
|
arc_done_func_t *awcb_ready;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_done_func_t *awcb_physdone;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_done_func_t *awcb_done;
|
|
|
|
arc_buf_t *awcb_buf;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct arc_buf_hdr {
|
|
|
|
/* protected by hash lock */
|
|
|
|
dva_t b_dva;
|
|
|
|
uint64_t b_birth;
|
|
|
|
uint64_t b_cksum0;
|
|
|
|
|
|
|
|
kmutex_t b_freeze_lock;
|
|
|
|
zio_cksum_t *b_freeze_cksum;
|
|
|
|
|
|
|
|
arc_buf_hdr_t *b_hash_next;
|
|
|
|
arc_buf_t *b_buf;
|
|
|
|
uint32_t b_flags;
|
|
|
|
uint32_t b_datacnt;
|
|
|
|
|
|
|
|
arc_callback_t *b_acb;
|
|
|
|
kcondvar_t b_cv;
|
|
|
|
|
|
|
|
/* immutable */
|
|
|
|
arc_buf_contents_t b_type;
|
|
|
|
uint64_t b_size;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t b_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* protected by arc state mutex */
|
|
|
|
arc_state_t *b_state;
|
|
|
|
list_node_t b_arc_node;
|
|
|
|
|
|
|
|
/* updated atomically */
|
|
|
|
clock_t b_arc_access;
|
2013-10-03 04:11:19 +04:00
|
|
|
uint32_t b_mru_hits;
|
|
|
|
uint32_t b_mru_ghost_hits;
|
|
|
|
uint32_t b_mfu_hits;
|
|
|
|
uint32_t b_mfu_ghost_hits;
|
|
|
|
uint32_t b_l2_hits;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* self protecting */
|
|
|
|
refcount_t b_refcnt;
|
|
|
|
|
|
|
|
l2arc_buf_hdr_t *b_l2hdr;
|
|
|
|
list_node_t b_l2node;
|
|
|
|
};
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
static list_t arc_prune_list;
|
|
|
|
static kmutex_t arc_prune_mtx;
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_buf_t *arc_eviction_list;
|
|
|
|
static kmutex_t arc_eviction_mtx;
|
|
|
|
static arc_buf_hdr_t arc_eviction_hdr;
|
|
|
|
static void arc_get_data_buf(arc_buf_t *buf);
|
|
|
|
static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
|
|
|
|
static int arc_evict_needed(arc_buf_contents_t type);
|
2013-07-25 21:28:45 +04:00
|
|
|
static void arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes,
|
|
|
|
arc_buf_contents_t type);
|
2013-05-17 01:18:06 +04:00
|
|
|
static void arc_buf_watch(arc_buf_t *buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static boolean_t l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#define GHOST_STATE(state) \
|
|
|
|
((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
|
|
|
|
(state) == arc_l2c_only)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Private ARC flags. These flags are private ARC only flags that will show up
|
|
|
|
* in b_flags in the arc_hdr_buf_t. Some flags are publicly declared, and can
|
|
|
|
* be passed in as arc_flags in things like arc_read. However, these flags
|
|
|
|
* should never be passed and should only be set by ARC code. When adding new
|
|
|
|
* public flags, make sure not to smash the private ones.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define ARC_IN_HASH_TABLE (1 << 9) /* this buffer is hashed */
|
|
|
|
#define ARC_IO_IN_PROGRESS (1 << 10) /* I/O in progress for buf */
|
|
|
|
#define ARC_IO_ERROR (1 << 11) /* I/O failed for buf */
|
|
|
|
#define ARC_FREED_IN_READ (1 << 12) /* buf freed while in read */
|
|
|
|
#define ARC_BUF_AVAILABLE (1 << 13) /* block not in active use */
|
|
|
|
#define ARC_INDIRECT (1 << 14) /* this is an indirect block */
|
|
|
|
#define ARC_FREE_IN_PROGRESS (1 << 15) /* hdr about to be freed */
|
2008-12-03 23:09:06 +03:00
|
|
|
#define ARC_L2_WRITING (1 << 16) /* L2ARC write in progress */
|
|
|
|
#define ARC_L2_EVICTED (1 << 17) /* evicted during I/O */
|
|
|
|
#define ARC_L2_WRITE_HEAD (1 << 18) /* head of write list */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_IN_HASH_TABLE)
|
|
|
|
#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
|
|
|
|
#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_IO_ERROR)
|
2009-02-18 23:51:31 +03:00
|
|
|
#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_PREFETCH)
|
2008-11-20 23:01:55 +03:00
|
|
|
#define HDR_FREED_IN_READ(hdr) ((hdr)->b_flags & ARC_FREED_IN_READ)
|
|
|
|
#define HDR_BUF_AVAILABLE(hdr) ((hdr)->b_flags & ARC_BUF_AVAILABLE)
|
|
|
|
#define HDR_FREE_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
|
2008-12-03 23:09:06 +03:00
|
|
|
#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_L2CACHE)
|
|
|
|
#define HDR_L2_READING(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
|
|
|
|
(hdr)->b_l2hdr != NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_L2_WRITING)
|
|
|
|
#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_L2_EVICTED)
|
|
|
|
#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Other sizes
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
|
|
|
|
#define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hash table routines
|
|
|
|
*/
|
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
#define HT_LOCK_ALIGN 64
|
|
|
|
#define HT_LOCK_PAD (P2NPHASE(sizeof (kmutex_t), (HT_LOCK_ALIGN)))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
struct ht_lock {
|
|
|
|
kmutex_t ht_lock;
|
|
|
|
#ifdef _KERNEL
|
2010-08-26 22:46:09 +04:00
|
|
|
unsigned char pad[HT_LOCK_PAD];
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
|
|
|
#define BUF_LOCKS 256
|
|
|
|
typedef struct buf_hash_table {
|
|
|
|
uint64_t ht_mask;
|
|
|
|
arc_buf_hdr_t **ht_table;
|
|
|
|
struct ht_lock ht_locks[BUF_LOCKS];
|
|
|
|
} buf_hash_table_t;
|
|
|
|
|
|
|
|
static buf_hash_table_t buf_hash_table;
|
|
|
|
|
|
|
|
#define BUF_HASH_INDEX(spa, dva, birth) \
|
|
|
|
(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
|
|
|
|
#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
|
|
|
|
#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
|
2010-05-29 00:45:14 +04:00
|
|
|
#define HDR_LOCK(hdr) \
|
|
|
|
(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
uint64_t zfs_crc64_table[256];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
|
2013-08-02 00:02:10 +04:00
|
|
|
#define L2ARC_HEADROOM 2 /* num of writes */
|
|
|
|
/*
|
|
|
|
* If we discover during ARC scan any buffers to be compressed, we boost
|
|
|
|
* our headroom for the next scanning cycle by this percentage multiple.
|
|
|
|
*/
|
|
|
|
#define L2ARC_HEADROOM_BOOST 200
|
2009-02-18 23:51:31 +03:00
|
|
|
#define L2ARC_FEED_SECS 1 /* caching interval secs */
|
|
|
|
#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
|
|
|
|
#define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
|
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/* L2ARC Performance Tunables */
|
2011-07-08 23:41:57 +04:00
|
|
|
unsigned long l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */
|
|
|
|
unsigned long l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */
|
|
|
|
unsigned long l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */
|
2013-08-02 00:02:10 +04:00
|
|
|
unsigned long l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;
|
2011-07-08 23:41:57 +04:00
|
|
|
unsigned long l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
|
|
|
|
unsigned long l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */
|
|
|
|
int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
|
2013-08-02 00:02:10 +04:00
|
|
|
int l2arc_nocompress = B_FALSE; /* don't compress bufs */
|
2011-07-08 23:41:57 +04:00
|
|
|
int l2arc_feed_again = B_TRUE; /* turbo warmup */
|
2013-07-24 20:57:56 +04:00
|
|
|
int l2arc_norw = B_FALSE; /* no reads during writes */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* L2ARC Internals
|
|
|
|
*/
|
|
|
|
typedef struct l2arc_dev {
|
|
|
|
vdev_t *l2ad_vdev; /* vdev */
|
|
|
|
spa_t *l2ad_spa; /* spa */
|
|
|
|
uint64_t l2ad_hand; /* next write location */
|
|
|
|
uint64_t l2ad_start; /* first addr on device */
|
|
|
|
uint64_t l2ad_end; /* last addr on device */
|
|
|
|
uint64_t l2ad_evict; /* last addr eviction reached */
|
|
|
|
boolean_t l2ad_first; /* first sweep through */
|
2009-02-18 23:51:31 +03:00
|
|
|
boolean_t l2ad_writing; /* currently writing */
|
2008-11-20 23:01:55 +03:00
|
|
|
list_t *l2ad_buflist; /* buffer list */
|
|
|
|
list_node_t l2ad_node; /* device list node */
|
|
|
|
} l2arc_dev_t;
|
|
|
|
|
|
|
|
static list_t L2ARC_dev_list; /* device list */
|
|
|
|
static list_t *l2arc_dev_list; /* device list pointer */
|
|
|
|
static kmutex_t l2arc_dev_mtx; /* device list mutex */
|
|
|
|
static l2arc_dev_t *l2arc_dev_last; /* last device used */
|
|
|
|
static kmutex_t l2arc_buflist_mtx; /* mutex for all buflists */
|
|
|
|
static list_t L2ARC_free_on_write; /* free after write buf list */
|
|
|
|
static list_t *l2arc_free_on_write; /* free after write list ptr */
|
|
|
|
static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
|
|
|
|
static uint64_t l2arc_ndev; /* number of devices */
|
|
|
|
|
|
|
|
typedef struct l2arc_read_callback {
|
2013-08-02 00:02:10 +04:00
|
|
|
arc_buf_t *l2rcb_buf; /* read buffer */
|
|
|
|
spa_t *l2rcb_spa; /* spa */
|
|
|
|
blkptr_t l2rcb_bp; /* original blkptr */
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t l2rcb_zb; /* original bookmark */
|
2013-08-02 00:02:10 +04:00
|
|
|
int l2rcb_flags; /* original flags */
|
|
|
|
enum zio_compress l2rcb_compress; /* applied compress */
|
2008-11-20 23:01:55 +03:00
|
|
|
} l2arc_read_callback_t;
|
|
|
|
|
|
|
|
typedef struct l2arc_write_callback {
|
|
|
|
l2arc_dev_t *l2wcb_dev; /* device info */
|
|
|
|
arc_buf_hdr_t *l2wcb_head; /* head of write buflist */
|
|
|
|
} l2arc_write_callback_t;
|
|
|
|
|
|
|
|
struct l2arc_buf_hdr {
|
|
|
|
/* protected by arc_buf_hdr mutex */
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_dev_t *b_dev; /* L2ARC device */
|
|
|
|
uint64_t b_daddr; /* disk address, offset byte */
|
|
|
|
/* compression applied to buffer data */
|
|
|
|
enum zio_compress b_compress;
|
|
|
|
/* real alloc'd buffer size depending on b_compress applied */
|
2013-10-03 04:11:19 +04:00
|
|
|
uint32_t b_hits;
|
2014-02-01 04:35:53 +04:00
|
|
|
uint64_t b_asize;
|
2013-08-02 00:02:10 +04:00
|
|
|
/* temporary buffer holder for in-flight compressed data */
|
|
|
|
void *b_tmp_cdata;
|
2008-11-20 23:01:55 +03:00
|
|
|
};
|
|
|
|
|
|
|
|
typedef struct l2arc_data_free {
|
|
|
|
/* protected by l2arc_free_on_write_mtx */
|
|
|
|
void *l2df_data;
|
|
|
|
size_t l2df_size;
|
|
|
|
void (*l2df_func)(void *, size_t);
|
|
|
|
list_node_t l2df_list_node;
|
|
|
|
} l2arc_data_free_t;
|
|
|
|
|
|
|
|
static kmutex_t l2arc_feed_thr_lock;
|
|
|
|
static kcondvar_t l2arc_feed_thr_cv;
|
|
|
|
static uint8_t l2arc_thread_exit;
|
|
|
|
|
|
|
|
static void l2arc_read_done(zio_t *zio);
|
|
|
|
static void l2arc_hdr_stat_add(void);
|
|
|
|
static void l2arc_hdr_stat_remove(void);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
static boolean_t l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr);
|
|
|
|
static void l2arc_decompress_zio(zio_t *zio, arc_buf_hdr_t *hdr,
|
|
|
|
enum zio_compress c);
|
|
|
|
static void l2arc_release_cdata_buf(arc_buf_hdr_t *ab);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static uint64_t
|
2009-02-18 23:51:31 +03:00
|
|
|
buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
uint8_t *vdva = (uint8_t *)dva;
|
|
|
|
uint64_t crc = -1ULL;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof (dva_t); i++)
|
|
|
|
crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
crc ^= (spa>>8) ^ birth;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (crc);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define BUF_EMPTY(buf) \
|
|
|
|
((buf)->b_dva.dva_word[0] == 0 && \
|
|
|
|
(buf)->b_dva.dva_word[1] == 0 && \
|
2013-12-09 22:37:51 +04:00
|
|
|
(buf)->b_cksum0 == 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#define BUF_EQUAL(spa, dva, birth, buf) \
|
|
|
|
((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
|
|
|
|
((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
|
|
|
|
((buf)->b_birth == birth) && ((buf)->b_spa == spa)
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
buf_discard_identity(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
|
|
|
hdr->b_dva.dva_word[0] = 0;
|
|
|
|
hdr->b_dva.dva_word[1] = 0;
|
|
|
|
hdr->b_birth = 0;
|
|
|
|
hdr->b_cksum0 = 0;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_buf_hdr_t *
|
2014-06-06 01:19:08 +04:00
|
|
|
buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
const dva_t *dva = BP_IDENTITY(bp);
|
|
|
|
uint64_t birth = BP_PHYSICAL_BIRTH(bp);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
|
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
|
|
|
arc_buf_hdr_t *buf;
|
|
|
|
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
|
|
|
|
buf = buf->b_hash_next) {
|
|
|
|
if (BUF_EQUAL(spa, dva, birth, buf)) {
|
|
|
|
*lockp = hash_lock;
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
*lockp = NULL;
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert an entry into the hash table. If there is already an element
|
|
|
|
* equal to elem in the hash table, then the already existing element
|
|
|
|
* will be returned and the new element will not be inserted.
|
|
|
|
* Otherwise returns NULL.
|
|
|
|
*/
|
|
|
|
static arc_buf_hdr_t *
|
|
|
|
buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
|
|
|
|
{
|
|
|
|
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
|
|
|
|
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
|
|
|
|
arc_buf_hdr_t *fbuf;
|
|
|
|
uint32_t i;
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!DVA_IS_EMPTY(&buf->b_dva));
|
|
|
|
ASSERT(buf->b_birth != 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(buf));
|
|
|
|
*lockp = hash_lock;
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
|
|
|
|
fbuf = fbuf->b_hash_next, i++) {
|
|
|
|
if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
|
|
|
|
return (fbuf);
|
|
|
|
}
|
|
|
|
|
|
|
|
buf->b_hash_next = buf_hash_table.ht_table[idx];
|
|
|
|
buf_hash_table.ht_table[idx] = buf;
|
|
|
|
buf->b_flags |= ARC_IN_HASH_TABLE;
|
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
|
|
|
if (i > 0) {
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_collisions);
|
|
|
|
if (i == 1)
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_chains);
|
|
|
|
|
|
|
|
ARCSTAT_MAX(arcstat_hash_chain_max, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
ARCSTAT_BUMP(arcstat_hash_elements);
|
|
|
|
ARCSTAT_MAXSTAT(arcstat_hash_elements);
|
|
|
|
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
buf_hash_remove(arc_buf_hdr_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *fbuf, **bufp;
|
|
|
|
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
|
|
|
|
ASSERT(HDR_IN_HASH_TABLE(buf));
|
|
|
|
|
|
|
|
bufp = &buf_hash_table.ht_table[idx];
|
|
|
|
while ((fbuf = *bufp) != buf) {
|
|
|
|
ASSERT(fbuf != NULL);
|
|
|
|
bufp = &fbuf->b_hash_next;
|
|
|
|
}
|
|
|
|
*bufp = buf->b_hash_next;
|
|
|
|
buf->b_hash_next = NULL;
|
|
|
|
buf->b_flags &= ~ARC_IN_HASH_TABLE;
|
|
|
|
|
|
|
|
/* collect some hash table performance data */
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_hash_elements);
|
|
|
|
|
|
|
|
if (buf_hash_table.ht_table[idx] &&
|
|
|
|
buf_hash_table.ht_table[idx]->b_hash_next == NULL)
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_hash_chains);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global data structures and functions for the buf kmem cache.
|
|
|
|
*/
|
|
|
|
static kmem_cache_t *hdr_cache;
|
|
|
|
static kmem_cache_t *buf_cache;
|
2013-11-20 01:34:46 +04:00
|
|
|
static kmem_cache_t *l2arc_hdr_cache;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static void
|
|
|
|
buf_fini(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2010-08-26 22:46:09 +04:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_free() in the linux kernel\
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
vmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(buf_hash_table.ht_table,
|
|
|
|
(buf_hash_table.ht_mask + 1) * sizeof (void *));
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
for (i = 0; i < BUF_LOCKS; i++)
|
|
|
|
mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
|
|
|
|
kmem_cache_destroy(hdr_cache);
|
|
|
|
kmem_cache_destroy(buf_cache);
|
2013-11-20 01:34:46 +04:00
|
|
|
kmem_cache_destroy(l2arc_hdr_cache);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Constructor callback - called when the cache is empty
|
|
|
|
* and a new buf is requested.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
hdr_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *buf = vbuf;
|
|
|
|
|
|
|
|
bzero(buf, sizeof (arc_buf_hdr_t));
|
|
|
|
refcount_create(&buf->b_refcnt);
|
|
|
|
cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
|
2010-08-26 21:26:44 +04:00
|
|
|
list_link_init(&buf->b_arc_node);
|
|
|
|
list_link_init(&buf->b_l2node);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static int
|
|
|
|
buf_cons(void *vbuf, void *unused, int kmflag)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf = vbuf;
|
|
|
|
|
|
|
|
bzero(buf, sizeof (arc_buf_t));
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Destructor callback - called when a cached buf is
|
|
|
|
* no longer required.
|
|
|
|
*/
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
hdr_dest(void *vbuf, void *unused)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *buf = vbuf;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(BUF_EMPTY(buf));
|
2008-11-20 23:01:55 +03:00
|
|
|
refcount_destroy(&buf->b_refcnt);
|
|
|
|
cv_destroy(&buf->b_cv);
|
|
|
|
mutex_destroy(&buf->b_freeze_lock);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
buf_dest(void *vbuf, void *unused)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf = vbuf;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_destroy(&buf->b_evict_lock);
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
buf_init(void)
|
|
|
|
{
|
|
|
|
uint64_t *ct;
|
|
|
|
uint64_t hsize = 1ULL << 12;
|
|
|
|
int i, j;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The hash table is big enough to fill all of physical memory
|
|
|
|
* with an average 64K block size. The table will take up
|
|
|
|
* totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
|
|
|
|
*/
|
|
|
|
while (hsize * 65536 < physmem * PAGESIZE)
|
|
|
|
hsize <<= 1;
|
|
|
|
retry:
|
|
|
|
buf_hash_table.ht_mask = hsize - 1;
|
2010-08-26 22:46:09 +04:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
2013-11-01 23:26:11 +04:00
|
|
|
/*
|
|
|
|
* Large allocations which do not require contiguous pages
|
|
|
|
* should be using vmem_alloc() in the linux kernel
|
|
|
|
*/
|
2010-08-26 22:46:09 +04:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);
|
|
|
|
#else
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_hash_table.ht_table =
|
|
|
|
kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
|
2010-08-26 22:46:09 +04:00
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
if (buf_hash_table.ht_table == NULL) {
|
|
|
|
ASSERT(hsize > (1ULL << 8));
|
|
|
|
hsize >>= 1;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
|
2012-03-14 01:29:16 +04:00
|
|
|
0, hdr_cons, hdr_dest, NULL, NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
|
2008-12-03 23:09:06 +03:00
|
|
|
0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
|
2013-11-20 01:34:46 +04:00
|
|
|
l2arc_hdr_cache = kmem_cache_create("l2arc_buf_hdr_t", L2HDR_SIZE,
|
|
|
|
0, NULL, NULL, NULL, NULL, NULL, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < 256; i++)
|
|
|
|
for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
|
|
|
|
*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
|
|
|
|
|
|
|
|
for (i = 0; i < BUF_LOCKS; i++) {
|
|
|
|
mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
|
|
|
|
NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#define ARC_MINTIME (hz>>4) /* 62 ms */
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_cksum_verify(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
zio_cksum_t zc;
|
|
|
|
|
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_hdr->b_freeze_lock);
|
|
|
|
if (buf->b_hdr->b_freeze_cksum == NULL ||
|
|
|
|
(buf->b_hdr->b_flags & ARC_IO_ERROR)) {
|
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
|
|
|
|
if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
|
|
|
|
panic("buffer modified while frozen!");
|
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
arc_cksum_equal(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
zio_cksum_t zc;
|
|
|
|
int equal;
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_hdr->b_freeze_lock);
|
|
|
|
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
|
|
|
|
equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
|
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
|
|
|
|
|
|
|
return (equal);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_cksum_compute(arc_buf_t *buf, boolean_t force)
|
|
|
|
{
|
|
|
|
if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_hdr->b_freeze_lock);
|
|
|
|
if (buf->b_hdr->b_freeze_cksum != NULL) {
|
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
|
|
|
return;
|
|
|
|
}
|
2012-04-10 21:55:17 +04:00
|
|
|
buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),
|
2013-11-01 23:26:11 +04:00
|
|
|
KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
|
|
|
|
buf->b_hdr->b_freeze_cksum);
|
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_watch(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef _KERNEL
|
|
|
|
void
|
|
|
|
arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)
|
|
|
|
{
|
|
|
|
panic("Got SIGSEGV at address: 0x%lx\n", (long) si->si_addr);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
arc_buf_unwatch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch) {
|
|
|
|
ASSERT0(mprotect(buf->b_data, buf->b_hdr->b_size,
|
|
|
|
PROT_READ | PROT_WRITE));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ARGSUSED */
|
|
|
|
static void
|
|
|
|
arc_buf_watch(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
#ifndef _KERNEL
|
|
|
|
if (arc_watch)
|
|
|
|
ASSERT0(mprotect(buf->b_data, buf->b_hdr->b_size, PROT_READ));
|
|
|
|
#endif
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_thaw(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
if (zfs_flags & ZFS_DEBUG_MODIFY) {
|
|
|
|
if (buf->b_hdr->b_state != arc_anon)
|
|
|
|
panic("modifying non-anon buffer!");
|
|
|
|
if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
|
|
|
|
panic("modifying buffer while i/o in progress!");
|
|
|
|
arc_cksum_verify(buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_hdr->b_freeze_lock);
|
|
|
|
if (buf->b_hdr->b_freeze_cksum != NULL) {
|
|
|
|
kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
|
|
|
|
buf->b_hdr->b_freeze_cksum = NULL;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&buf->b_hdr->b_freeze_lock);
|
2013-05-17 01:18:06 +04:00
|
|
|
|
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_freeze(arc_buf_t *buf)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
hash_lock = HDR_LOCK(buf->b_hdr);
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
|
|
|
|
buf->b_hdr->b_state == arc_anon);
|
|
|
|
arc_cksum_compute(buf, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(hash_lock);
|
2013-05-17 01:18:06 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
|
|
|
|
|
|
|
if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
|
|
|
|
(ab->b_state != arc_anon)) {
|
|
|
|
uint64_t delta = ab->b_size * ab->b_datacnt;
|
|
|
|
list_t *list = &ab->b_state->arcs_list[ab->b_type];
|
|
|
|
uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
|
|
|
|
|
|
|
|
ASSERT(!MUTEX_HELD(&ab->b_state->arcs_mtx));
|
|
|
|
mutex_enter(&ab->b_state->arcs_mtx);
|
|
|
|
ASSERT(list_link_active(&ab->b_arc_node));
|
|
|
|
list_remove(list, ab);
|
|
|
|
if (GHOST_STATE(ab->b_state)) {
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(ab->b_datacnt);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3P(ab->b_buf, ==, NULL);
|
|
|
|
delta = ab->b_size;
|
|
|
|
}
|
|
|
|
ASSERT(delta > 0);
|
|
|
|
ASSERT3U(*size, >=, delta);
|
|
|
|
atomic_add_64(size, -delta);
|
|
|
|
mutex_exit(&ab->b_state->arcs_mtx);
|
2008-12-03 23:09:06 +03:00
|
|
|
/* remove the prefetch flag if we get a reference */
|
2008-11-20 23:01:55 +03:00
|
|
|
if (ab->b_flags & ARC_PREFETCH)
|
|
|
|
ab->b_flags &= ~ARC_PREFETCH;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
|
|
|
|
{
|
|
|
|
int cnt;
|
|
|
|
arc_state_t *state = ab->b_state;
|
|
|
|
|
|
|
|
ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
|
|
|
|
ASSERT(!GHOST_STATE(state));
|
|
|
|
|
|
|
|
if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
|
|
|
|
(state != arc_anon)) {
|
|
|
|
uint64_t *size = &state->arcs_lsize[ab->b_type];
|
|
|
|
|
|
|
|
ASSERT(!MUTEX_HELD(&state->arcs_mtx));
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
ASSERT(!list_link_active(&ab->b_arc_node));
|
|
|
|
list_insert_head(&state->arcs_list[ab->b_type], ab);
|
|
|
|
ASSERT(ab->b_datacnt > 0);
|
|
|
|
atomic_add_64(size, ab->b_size * ab->b_datacnt);
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
}
|
|
|
|
return (cnt);
|
|
|
|
}
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
/*
|
|
|
|
* Returns detailed information about a specific arc buffer. When the
|
|
|
|
* state_index argument is set the function will calculate the arc header
|
|
|
|
* list position for its arc state. Since this requires a linear traversal
|
|
|
|
* callers are strongly encourage not to do this. However, it can be helpful
|
|
|
|
* for targeted analysis so the functionality is provided.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = ab->b_hdr;
|
|
|
|
arc_state_t *state = hdr->b_state;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
memset(abi, 0, sizeof (arc_buf_info_t));
|
2013-10-03 04:11:19 +04:00
|
|
|
abi->abi_flags = hdr->b_flags;
|
|
|
|
abi->abi_datacnt = hdr->b_datacnt;
|
|
|
|
abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;
|
|
|
|
abi->abi_state_contents = hdr->b_type;
|
|
|
|
abi->abi_state_index = -1;
|
|
|
|
abi->abi_size = hdr->b_size;
|
|
|
|
abi->abi_access = hdr->b_arc_access;
|
|
|
|
abi->abi_mru_hits = hdr->b_mru_hits;
|
|
|
|
abi->abi_mru_ghost_hits = hdr->b_mru_ghost_hits;
|
|
|
|
abi->abi_mfu_hits = hdr->b_mfu_hits;
|
|
|
|
abi->abi_mfu_ghost_hits = hdr->b_mfu_ghost_hits;
|
|
|
|
abi->abi_holds = refcount_count(&hdr->b_refcnt);
|
|
|
|
|
|
|
|
if (hdr->b_l2hdr) {
|
|
|
|
abi->abi_l2arc_dattr = hdr->b_l2hdr->b_daddr;
|
|
|
|
abi->abi_l2arc_asize = hdr->b_l2hdr->b_asize;
|
|
|
|
abi->abi_l2arc_compress = hdr->b_l2hdr->b_compress;
|
|
|
|
abi->abi_l2arc_hits = hdr->b_l2hdr->b_hits;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (state && state_index && list_link_active(&hdr->b_arc_node)) {
|
|
|
|
list_t *list = &state->arcs_list[hdr->b_type];
|
|
|
|
arc_buf_hdr_t *h;
|
|
|
|
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
for (h = list_head(list); h != NULL; h = list_next(list, h)) {
|
|
|
|
abi->abi_state_index++;
|
|
|
|
if (h == hdr)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Move the supplied buffer to the indicated state. The mutex
|
|
|
|
* for the buffer must be held by the caller.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
|
|
|
|
{
|
|
|
|
arc_state_t *old_state = ab->b_state;
|
|
|
|
int64_t refcnt = refcount_count(&ab->b_refcnt);
|
|
|
|
uint64_t from_delta, to_delta;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
ASSERT3P(new_state, !=, old_state);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(refcnt == 0 || ab->b_datacnt > 0);
|
|
|
|
ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(ab->b_datacnt <= 1 || old_state != arc_anon);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
from_delta = to_delta = ab->b_datacnt * ab->b_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this buffer is evictable, transfer it from the
|
|
|
|
* old state list to the new state list.
|
|
|
|
*/
|
|
|
|
if (refcnt == 0) {
|
|
|
|
if (old_state != arc_anon) {
|
|
|
|
int use_mutex = !MUTEX_HELD(&old_state->arcs_mtx);
|
|
|
|
uint64_t *size = &old_state->arcs_lsize[ab->b_type];
|
|
|
|
|
|
|
|
if (use_mutex)
|
|
|
|
mutex_enter(&old_state->arcs_mtx);
|
|
|
|
|
|
|
|
ASSERT(list_link_active(&ab->b_arc_node));
|
|
|
|
list_remove(&old_state->arcs_list[ab->b_type], ab);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If prefetching out of the ghost cache,
|
2010-05-29 00:45:14 +04:00
|
|
|
* we will have a non-zero datacnt.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
|
|
|
|
/* ghost elements have a ghost size */
|
|
|
|
ASSERT(ab->b_buf == NULL);
|
|
|
|
from_delta = ab->b_size;
|
|
|
|
}
|
|
|
|
ASSERT3U(*size, >=, from_delta);
|
|
|
|
atomic_add_64(size, -from_delta);
|
|
|
|
|
|
|
|
if (use_mutex)
|
|
|
|
mutex_exit(&old_state->arcs_mtx);
|
|
|
|
}
|
|
|
|
if (new_state != arc_anon) {
|
|
|
|
int use_mutex = !MUTEX_HELD(&new_state->arcs_mtx);
|
|
|
|
uint64_t *size = &new_state->arcs_lsize[ab->b_type];
|
|
|
|
|
|
|
|
if (use_mutex)
|
|
|
|
mutex_enter(&new_state->arcs_mtx);
|
|
|
|
|
|
|
|
list_insert_head(&new_state->arcs_list[ab->b_type], ab);
|
|
|
|
|
|
|
|
/* ghost elements have a ghost size */
|
|
|
|
if (GHOST_STATE(new_state)) {
|
|
|
|
ASSERT(ab->b_datacnt == 0);
|
|
|
|
ASSERT(ab->b_buf == NULL);
|
|
|
|
to_delta = ab->b_size;
|
|
|
|
}
|
|
|
|
atomic_add_64(size, to_delta);
|
|
|
|
|
|
|
|
if (use_mutex)
|
|
|
|
mutex_exit(&new_state->arcs_mtx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(!BUF_EMPTY(ab));
|
2010-05-29 00:45:14 +04:00
|
|
|
if (new_state == arc_anon && HDR_IN_HASH_TABLE(ab))
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_hash_remove(ab);
|
|
|
|
|
|
|
|
/* adjust state sizes */
|
|
|
|
if (to_delta)
|
|
|
|
atomic_add_64(&new_state->arcs_size, to_delta);
|
|
|
|
if (from_delta) {
|
|
|
|
ASSERT3U(old_state->arcs_size, >=, from_delta);
|
|
|
|
atomic_add_64(&old_state->arcs_size, -from_delta);
|
|
|
|
}
|
|
|
|
ab->b_state = new_state;
|
|
|
|
|
|
|
|
/* adjust l2arc hdr stats */
|
|
|
|
if (new_state == arc_l2c_only)
|
|
|
|
l2arc_hdr_stat_add();
|
|
|
|
else if (old_state == arc_l2c_only)
|
|
|
|
l2arc_hdr_stat_remove();
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_consume(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
|
|
|
ARCSTAT_INCR(arcstat_data_size, space);
|
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
|
|
|
ARCSTAT_INCR(arcstat_meta_size, space);
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_OTHER:
|
|
|
|
ARCSTAT_INCR(arcstat_other_size, space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, space);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-02-04 00:41:47 +04:00
|
|
|
if (type != ARC_SPACE_DATA)
|
|
|
|
ARCSTAT_INCR(arcstat_meta_used, space);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
atomic_add_64(&arc_size, space);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_space_return(uint64_t space, arc_space_type_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
|
|
|
|
|
|
|
|
switch (type) {
|
2010-08-26 20:52:41 +04:00
|
|
|
default:
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_DATA:
|
|
|
|
ARCSTAT_INCR(arcstat_data_size, -space);
|
|
|
|
break;
|
2014-02-04 00:41:47 +04:00
|
|
|
case ARC_SPACE_META:
|
|
|
|
ARCSTAT_INCR(arcstat_meta_size, -space);
|
|
|
|
break;
|
2009-02-18 23:51:31 +03:00
|
|
|
case ARC_SPACE_OTHER:
|
|
|
|
ARCSTAT_INCR(arcstat_other_size, -space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, -space);
|
|
|
|
break;
|
|
|
|
case ARC_SPACE_L2HDRS:
|
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-02-04 00:41:47 +04:00
|
|
|
if (type != ARC_SPACE_DATA) {
|
|
|
|
ASSERT(arc_meta_used >= space);
|
|
|
|
if (arc_meta_max < arc_meta_used)
|
|
|
|
arc_meta_max = arc_meta_used;
|
|
|
|
ARCSTAT_INCR(arcstat_meta_used, -space);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(arc_size >= space);
|
|
|
|
atomic_add_64(&arc_size, -space);
|
|
|
|
}
|
|
|
|
|
|
|
|
arc_buf_t *
|
|
|
|
arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
|
|
|
|
ASSERT3U(size, >, 0);
|
|
|
|
hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
|
|
|
|
ASSERT(BUF_EMPTY(hdr));
|
|
|
|
hdr->b_size = size;
|
|
|
|
hdr->b_type = type;
|
2011-11-12 02:07:54 +04:00
|
|
|
hdr->b_spa = spa_load_guid(spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_state = arc_anon;
|
|
|
|
hdr->b_arc_access = 0;
|
2013-10-03 04:11:19 +04:00
|
|
|
hdr->b_mru_hits = 0;
|
|
|
|
hdr->b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_mfu_hits = 0;
|
|
|
|
hdr->b_mfu_ghost_hits = 0;
|
|
|
|
hdr->b_l2_hits = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
|
|
|
|
buf->b_hdr = hdr;
|
|
|
|
buf->b_data = NULL;
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
buf->b_next = NULL;
|
|
|
|
hdr->b_buf = buf;
|
|
|
|
arc_get_data_buf(buf);
|
|
|
|
hdr->b_datacnt = 1;
|
|
|
|
hdr->b_flags = 0;
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
(void) refcount_add(&hdr->b_refcnt, tag);
|
|
|
|
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static char *arc_onloan_tag = "onloan";
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Loan out an anonymous arc buffer. Loaned buffers are not counted as in
|
|
|
|
* flight data by arc_tempreserve_space() until they are "returned". Loaned
|
|
|
|
* buffers must be returned to the arc before they can be used by the DMU or
|
|
|
|
* freed.
|
|
|
|
*/
|
|
|
|
arc_buf_t *
|
|
|
|
arc_loan_buf(spa_t *spa, int size)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf;
|
|
|
|
|
|
|
|
buf = arc_buf_alloc(spa, size, arc_onloan_tag, ARC_BUFC_DATA);
|
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, size);
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a loaned arc buffer to the arc.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_return_buf(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
ASSERT(buf->b_data != NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
(void) refcount_add(&hdr->b_refcnt, tag);
|
|
|
|
(void) refcount_remove(&hdr->b_refcnt, arc_onloan_tag);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, -hdr->b_size);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Detach an arc_buf from a dbuf (tag) */
|
|
|
|
void
|
|
|
|
arc_loan_inuse_buf(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
|
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
hdr = buf->b_hdr;
|
|
|
|
(void) refcount_add(&hdr->b_refcnt, arc_onloan_tag);
|
|
|
|
(void) refcount_remove(&hdr->b_refcnt, tag);
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
|
|
|
|
atomic_add_64(&arc_loaned_bytes, hdr->b_size);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static arc_buf_t *
|
|
|
|
arc_buf_clone(arc_buf_t *from)
|
|
|
|
{
|
|
|
|
arc_buf_t *buf;
|
|
|
|
arc_buf_hdr_t *hdr = from->b_hdr;
|
|
|
|
uint64_t size = hdr->b_size;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(hdr->b_state != arc_anon);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
|
|
|
|
buf->b_hdr = hdr;
|
|
|
|
buf->b_data = NULL;
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
buf->b_next = hdr->b_buf;
|
|
|
|
hdr->b_buf = buf;
|
|
|
|
arc_get_data_buf(buf);
|
|
|
|
bcopy(from->b_data, buf->b_data, size);
|
2012-12-22 02:57:09 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This buffer already exists in the arc so create a duplicate
|
|
|
|
* copy for the caller. If the buffer is associated with user data
|
|
|
|
* then track the size and number of duplicates. These stats will be
|
|
|
|
* updated as duplicate buffers are created and destroyed.
|
|
|
|
*/
|
|
|
|
if (hdr->b_type == ARC_BUFC_DATA) {
|
|
|
|
ARCSTAT_BUMP(arcstat_duplicate_buffers);
|
|
|
|
ARCSTAT_INCR(arcstat_duplicate_buffers_size, size);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_datacnt += 1;
|
|
|
|
return (buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_add_ref(arc_buf_t *buf, void* tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Check to see if this buffer is evicted. Callers
|
|
|
|
* must verify b_data != NULL to know if the add_ref
|
|
|
|
* was successful.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (buf->b_data == NULL) {
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
hash_lock = HDR_LOCK(buf->b_hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
|
|
|
|
add_reference(hdr, hash_lock, tag);
|
2009-02-18 23:51:31 +03:00
|
|
|
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_access(hdr, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_hits);
|
|
|
|
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
|
|
|
|
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
|
|
|
|
data, metadata, hits);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free the arc data buffer. If it is an l2arc write in progress,
|
|
|
|
* the buffer is placed on l2arc_free_on_write to be freed later.
|
|
|
|
*/
|
|
|
|
static void
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_data_free(arc_buf_t *buf, void (*free_func)(void *, size_t))
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (HDR_L2_WRITING(hdr)) {
|
|
|
|
l2arc_data_free_t *df;
|
2012-09-04 00:05:19 +04:00
|
|
|
df = kmem_alloc(sizeof (l2arc_data_free_t), KM_PUSHPAGE);
|
2013-05-17 01:18:06 +04:00
|
|
|
df->l2df_data = buf->b_data;
|
|
|
|
df->l2df_size = hdr->b_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
df->l2df_func = free_func;
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
|
|
|
list_insert_head(l2arc_free_on_write, df);
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_free_on_write);
|
|
|
|
} else {
|
2013-05-17 01:18:06 +04:00
|
|
|
free_func(buf->b_data, hdr->b_size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
|
|
|
|
{
|
|
|
|
arc_buf_t **bufp;
|
|
|
|
|
|
|
|
/* free up data associated with the buf */
|
|
|
|
if (buf->b_data) {
|
|
|
|
arc_state_t *state = buf->b_hdr->b_state;
|
|
|
|
uint64_t size = buf->b_hdr->b_size;
|
|
|
|
arc_buf_contents_t type = buf->b_hdr->b_type;
|
|
|
|
|
|
|
|
arc_cksum_verify(buf);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!recycle) {
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_data_free(buf, zio_buf_free);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_return(size, ARC_SPACE_META);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_data_free(buf, zio_data_buf_free);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_return(size, ARC_SPACE_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
if (list_link_active(&buf->b_hdr->b_arc_node)) {
|
|
|
|
uint64_t *cnt = &state->arcs_lsize[type];
|
|
|
|
|
|
|
|
ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
|
|
|
|
ASSERT(state != arc_anon);
|
|
|
|
|
|
|
|
ASSERT3U(*cnt, >=, size);
|
|
|
|
atomic_add_64(cnt, -size);
|
|
|
|
}
|
|
|
|
ASSERT3U(state->arcs_size, >=, size);
|
|
|
|
atomic_add_64(&state->arcs_size, -size);
|
|
|
|
buf->b_data = NULL;
|
2012-12-22 02:57:09 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're destroying a duplicate buffer make sure
|
|
|
|
* that the appropriate statistics are updated.
|
|
|
|
*/
|
|
|
|
if (buf->b_hdr->b_datacnt > 1 &&
|
|
|
|
buf->b_hdr->b_type == ARC_BUFC_DATA) {
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
|
|
|
|
ARCSTAT_INCR(arcstat_duplicate_buffers_size, -size);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(buf->b_hdr->b_datacnt > 0);
|
|
|
|
buf->b_hdr->b_datacnt -= 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* only remove the buf if requested */
|
|
|
|
if (!all)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* remove the buf from the hdr list */
|
|
|
|
for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
|
|
|
|
continue;
|
|
|
|
*bufp = buf->b_next;
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
|
|
|
|
|
|
|
/* clean up the buf */
|
|
|
|
buf->b_hdr = NULL;
|
|
|
|
kmem_cache_free(buf_cache, buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_hdr_destroy(arc_buf_hdr_t *hdr)
|
|
|
|
{
|
2010-08-26 20:52:39 +04:00
|
|
|
l2arc_buf_hdr_t *l2hdr = hdr->b_l2hdr;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
ASSERT3P(hdr->b_state, ==, arc_anon);
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (l2hdr != NULL) {
|
|
|
|
boolean_t buflist_held = MUTEX_HELD(&l2arc_buflist_mtx);
|
|
|
|
/*
|
|
|
|
* To prevent arc_free() and l2arc_evict() from
|
|
|
|
* attempting to free the same buffer at the same time,
|
|
|
|
* a FREE_IN_PROGRESS flag is given to arc_free() to
|
|
|
|
* give it priority. l2arc_evict() can't destroy this
|
|
|
|
* header while we are waiting on l2arc_buflist_mtx.
|
|
|
|
*
|
|
|
|
* The hdr may be removed from l2ad_buflist before we
|
|
|
|
* grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
|
|
|
|
*/
|
|
|
|
if (!buflist_held) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&l2arc_buflist_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
l2hdr = hdr->b_l2hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (l2hdr != NULL) {
|
|
|
|
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
|
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(l2hdr->b_dev->l2ad_vdev,
|
|
|
|
-l2hdr->b_asize, 0, 0);
|
2013-11-20 01:34:46 +04:00
|
|
|
kmem_cache_free(l2arc_hdr_cache, l2hdr);
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (hdr->b_state == arc_l2c_only)
|
|
|
|
l2arc_hdr_stat_remove();
|
|
|
|
hdr->b_l2hdr = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!buflist_held)
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!BUF_EMPTY(hdr)) {
|
|
|
|
ASSERT(!HDR_IN_HASH_TABLE(hdr));
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
while (hdr->b_buf) {
|
|
|
|
arc_buf_t *buf = hdr->b_buf;
|
|
|
|
|
|
|
|
if (buf->b_efunc) {
|
|
|
|
mutex_enter(&arc_eviction_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(buf->b_hdr != NULL);
|
|
|
|
arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
|
|
|
|
hdr->b_buf = buf->b_next;
|
|
|
|
buf->b_hdr = &arc_eviction_hdr;
|
|
|
|
buf->b_next = arc_eviction_list;
|
|
|
|
arc_eviction_list = buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&arc_eviction_mtx);
|
|
|
|
} else {
|
|
|
|
arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (hdr->b_freeze_cksum != NULL) {
|
|
|
|
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
|
|
|
|
hdr->b_freeze_cksum = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(!list_link_active(&hdr->b_arc_node));
|
|
|
|
ASSERT3P(hdr->b_hash_next, ==, NULL);
|
|
|
|
ASSERT3P(hdr->b_acb, ==, NULL);
|
|
|
|
kmem_cache_free(hdr_cache, hdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_buf_free(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
int hashed = hdr->b_state != arc_anon;
|
|
|
|
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
|
|
|
|
if (hashed) {
|
|
|
|
kmutex_t *hash_lock = HDR_LOCK(hdr);
|
|
|
|
|
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) remove_reference(hdr, hash_lock, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (hdr->b_datacnt > 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_destroy(buf, FALSE, TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
ASSERT(buf == hdr->b_buf);
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_flags |= ARC_BUF_AVAILABLE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
int destroy_hdr;
|
|
|
|
/*
|
|
|
|
* We are in the middle of an async write. Don't destroy
|
|
|
|
* this buffer unless the write completes before we finish
|
|
|
|
* decrementing the reference count.
|
|
|
|
*/
|
|
|
|
mutex_enter(&arc_eviction_mtx);
|
|
|
|
(void) remove_reference(hdr, NULL, tag);
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
|
|
|
|
mutex_exit(&arc_eviction_mtx);
|
|
|
|
if (destroy_hdr)
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
if (remove_reference(hdr, NULL, tag) > 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_destroy(buf, FALSE, TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
else
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
boolean_t
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_remove_ref(arc_buf_t *buf, void* tag)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
2013-06-29 11:03:49 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2013-09-04 16:00:57 +04:00
|
|
|
boolean_t no_callback = (buf->b_efunc == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (hdr->b_state == arc_anon) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(hdr->b_datacnt == 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_free(buf, tag);
|
|
|
|
return (no_callback);
|
|
|
|
}
|
|
|
|
|
2013-06-29 11:03:49 +04:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(hdr->b_state != arc_anon);
|
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
|
|
|
|
(void) remove_reference(hdr, hash_lock, tag);
|
|
|
|
if (hdr->b_datacnt > 1) {
|
|
|
|
if (no_callback)
|
|
|
|
arc_buf_destroy(buf, FALSE, TRUE);
|
|
|
|
} else if (no_callback) {
|
|
|
|
ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_efunc == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_flags |= ARC_BUF_AVAILABLE;
|
|
|
|
}
|
|
|
|
ASSERT(no_callback || hdr->b_datacnt > 1 ||
|
|
|
|
refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
return (no_callback);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_buf_size(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
return (buf->b_hdr->b_size);
|
|
|
|
}
|
|
|
|
|
2012-12-22 02:57:09 +04:00
|
|
|
/*
|
|
|
|
* Called from the DMU to determine if the current buffer should be
|
|
|
|
* evicted. In order to ensure proper locking, the eviction must be initiated
|
|
|
|
* from the DMU. Return true if the buffer is associated with user data and
|
|
|
|
* duplicate buffers still exist.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
arc_buf_eviction_needed(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
boolean_t evict_needed = B_FALSE;
|
|
|
|
|
|
|
|
if (zfs_disable_dup_eviction)
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_evict_lock);
|
|
|
|
hdr = buf->b_hdr;
|
|
|
|
if (hdr == NULL) {
|
|
|
|
/*
|
|
|
|
* We are in arc_do_user_evicts(); let that function
|
|
|
|
* perform the eviction.
|
|
|
|
*/
|
|
|
|
ASSERT(buf->b_data == NULL);
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
|
|
|
return (B_FALSE);
|
|
|
|
} else if (buf->b_data == NULL) {
|
|
|
|
/*
|
|
|
|
* We have already been added to the arc eviction list;
|
|
|
|
* recommend eviction.
|
|
|
|
*/
|
|
|
|
ASSERT3P(hdr, ==, &arc_eviction_hdr);
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr->b_datacnt > 1 && hdr->b_type == ARC_BUFC_DATA)
|
|
|
|
evict_needed = B_TRUE;
|
|
|
|
|
|
|
|
mutex_exit(&buf->b_evict_lock);
|
|
|
|
return (evict_needed);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Evict buffers from list until we've removed the specified number of
|
|
|
|
* bytes. Move the removed buffers to the appropriate evict state.
|
|
|
|
* If the recycle flag is set, then attempt to "recycle" a buffer:
|
|
|
|
* - look for a buffer to evict that is `bytes' long.
|
|
|
|
* - return the data block from this buffer rather than freeing it.
|
|
|
|
* This flag is used by callers that are trying to make space for a
|
|
|
|
* new buffer in a full arc cache.
|
|
|
|
*
|
|
|
|
* This function makes a "best effort". It skips over any buffers
|
|
|
|
* it can't get a hash_lock on, and so may not catch all candidates.
|
|
|
|
* It may also return without evicting as much space as requested.
|
|
|
|
*/
|
|
|
|
static void *
|
2009-02-18 23:51:31 +03:00
|
|
|
arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
arc_state_t *evicted_state;
|
|
|
|
uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
|
|
|
|
arc_buf_hdr_t *ab, *ab_prev = NULL;
|
|
|
|
list_t *list = &state->arcs_list[type];
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
boolean_t have_lock;
|
|
|
|
void *stolen = NULL;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_buf_hdr_t marker = {{{ 0 }}};
|
|
|
|
int count = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(state == arc_mru || state == arc_mfu);
|
|
|
|
|
|
|
|
evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
|
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
top:
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
mutex_enter(&evicted_state->arcs_mtx);
|
|
|
|
|
|
|
|
for (ab = list_tail(list); ab; ab = ab_prev) {
|
|
|
|
ab_prev = list_prev(list, ab);
|
|
|
|
/* prefetch buffers have a minimum lifespan */
|
|
|
|
if (HDR_IO_IN_PROGRESS(ab) ||
|
|
|
|
(spa && ab->b_spa != spa) ||
|
|
|
|
(ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
|
2010-05-29 00:45:14 +04:00
|
|
|
ddi_get_lbolt() - ab->b_arc_access <
|
2013-07-24 21:14:11 +04:00
|
|
|
zfs_arc_min_prefetch_lifespan)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
skipped++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* "lookahead" for better eviction candidate */
|
|
|
|
if (recycle && ab->b_size != bytes &&
|
|
|
|
ab_prev && ab_prev->b_size == bytes)
|
|
|
|
continue;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/* ignore markers */
|
|
|
|
if (ab->b_spa == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* It may take a long time to evict all the bufs requested.
|
|
|
|
* To avoid blocking all arc activity, periodically drop
|
|
|
|
* the arcs_mtx and give other threads a chance to run
|
|
|
|
* before reacquiring the lock.
|
|
|
|
*
|
|
|
|
* If we are looking for a buffer to recycle, we are in
|
|
|
|
* the hot code path, so don't sleep.
|
|
|
|
*/
|
|
|
|
if (!recycle && count++ > arc_evict_iterations) {
|
|
|
|
list_insert_after(list, ab, &marker);
|
|
|
|
mutex_exit(&evicted_state->arcs_mtx);
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
kpreempt(KPREEMPT_SYNC);
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
mutex_enter(&evicted_state->arcs_mtx);
|
|
|
|
ab_prev = list_prev(list, &marker);
|
|
|
|
list_remove(list, &marker);
|
|
|
|
count = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
hash_lock = HDR_LOCK(ab);
|
|
|
|
have_lock = MUTEX_HELD(hash_lock);
|
|
|
|
if (have_lock || mutex_tryenter(hash_lock)) {
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(refcount_count(&ab->b_refcnt));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(ab->b_datacnt > 0);
|
|
|
|
while (ab->b_buf) {
|
|
|
|
arc_buf_t *buf = ab->b_buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!mutex_tryenter(&buf->b_evict_lock)) {
|
2008-12-03 23:09:06 +03:00
|
|
|
missed += 1;
|
|
|
|
break;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (buf->b_data) {
|
|
|
|
bytes_evicted += ab->b_size;
|
|
|
|
if (recycle && ab->b_type == type &&
|
|
|
|
ab->b_size == bytes &&
|
|
|
|
!HDR_L2_WRITING(ab)) {
|
|
|
|
stolen = buf->b_data;
|
|
|
|
recycle = FALSE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (buf->b_efunc) {
|
|
|
|
mutex_enter(&arc_eviction_mtx);
|
|
|
|
arc_buf_destroy(buf,
|
|
|
|
buf->b_data == stolen, FALSE);
|
|
|
|
ab->b_buf = buf->b_next;
|
|
|
|
buf->b_hdr = &arc_eviction_hdr;
|
|
|
|
buf->b_next = arc_eviction_list;
|
|
|
|
arc_eviction_list = buf;
|
|
|
|
mutex_exit(&arc_eviction_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_destroy(buf,
|
|
|
|
buf->b_data == stolen, TRUE);
|
|
|
|
}
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
if (ab->b_l2hdr) {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_cached,
|
|
|
|
ab->b_size);
|
|
|
|
} else {
|
|
|
|
if (l2arc_write_eligible(ab->b_spa, ab)) {
|
|
|
|
ARCSTAT_INCR(arcstat_evict_l2_eligible,
|
|
|
|
ab->b_size);
|
|
|
|
} else {
|
|
|
|
ARCSTAT_INCR(
|
|
|
|
arcstat_evict_l2_ineligible,
|
|
|
|
ab->b_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (ab->b_datacnt == 0) {
|
|
|
|
arc_change_state(evicted_state, ab, hash_lock);
|
|
|
|
ASSERT(HDR_IN_HASH_TABLE(ab));
|
|
|
|
ab->b_flags |= ARC_IN_HASH_TABLE;
|
|
|
|
ab->b_flags &= ~ARC_BUF_AVAILABLE;
|
|
|
|
DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!have_lock)
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
if (bytes >= 0 && bytes_evicted >= bytes)
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
missed += 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&evicted_state->arcs_mtx);
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (list == &state->arcs_list[ARC_BUFC_DATA] &&
|
|
|
|
(bytes < 0 || bytes_evicted < bytes)) {
|
|
|
|
/* Prevent second pass from recycling metadata into data */
|
|
|
|
recycle = FALSE;
|
|
|
|
type = ARC_BUFC_METADATA;
|
|
|
|
list = &state->arcs_list[type];
|
|
|
|
goto top;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (bytes_evicted < bytes)
|
2010-08-26 21:28:31 +04:00
|
|
|
dprintf("only evicted %lld bytes from %x\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(longlong_t)bytes_evicted, state);
|
|
|
|
|
|
|
|
if (skipped)
|
|
|
|
ARCSTAT_INCR(arcstat_evict_skip, skipped);
|
|
|
|
|
|
|
|
if (missed)
|
|
|
|
ARCSTAT_INCR(arcstat_mutex_miss, missed);
|
|
|
|
|
|
|
|
/*
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
* Note: we have just evicted some data into the ghost state,
|
|
|
|
* potentially putting the ghost size over the desired size. Rather
|
|
|
|
* that evicting from the ghost list in this hot code path, leave
|
|
|
|
* this chore to the arc_reclaim_thread().
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
|
|
|
return (stolen);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove buffers from list until we've removed the specified number of
|
|
|
|
* bytes. Destroy the buffers that are removed.
|
|
|
|
*/
|
|
|
|
static void
|
2013-07-25 21:28:45 +04:00
|
|
|
arc_evict_ghost(arc_state_t *state, uint64_t spa, int64_t bytes,
|
|
|
|
arc_buf_contents_t type)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *ab, *ab_prev;
|
2010-08-26 20:52:41 +04:00
|
|
|
arc_buf_hdr_t marker;
|
2013-07-25 21:28:45 +04:00
|
|
|
list_t *list = &state->arcs_list[type];
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t bytes_deleted = 0;
|
|
|
|
uint64_t bufs_skipped = 0;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
int count = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(GHOST_STATE(state));
|
2013-11-01 23:26:11 +04:00
|
|
|
bzero(&marker, sizeof (marker));
|
2008-11-20 23:01:55 +03:00
|
|
|
top:
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
for (ab = list_tail(list); ab; ab = ab_prev) {
|
|
|
|
ab_prev = list_prev(list, ab);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
if (ab->b_type > ARC_BUFC_NUMTYPES)
|
|
|
|
panic("invalid ab=%p", (void *)ab);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa && ab->b_spa != spa)
|
|
|
|
continue;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
/* ignore markers */
|
|
|
|
if (ab->b_spa == 0)
|
|
|
|
continue;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
hash_lock = HDR_LOCK(ab);
|
2010-05-29 00:45:14 +04:00
|
|
|
/* caller may be trying to modify this buffer, skip it */
|
|
|
|
if (MUTEX_HELD(hash_lock))
|
|
|
|
continue;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It may take a long time to evict all the bufs requested.
|
|
|
|
* To avoid blocking all arc activity, periodically drop
|
|
|
|
* the arcs_mtx and give other threads a chance to run
|
|
|
|
* before reacquiring the lock.
|
|
|
|
*/
|
|
|
|
if (count++ > arc_evict_iterations) {
|
|
|
|
list_insert_after(list, ab, &marker);
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
kpreempt(KPREEMPT_SYNC);
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
ab_prev = list_prev(list, &marker);
|
|
|
|
list_remove(list, &marker);
|
|
|
|
count = 0;
|
|
|
|
continue;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (mutex_tryenter(hash_lock)) {
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(ab));
|
|
|
|
ASSERT(ab->b_buf == NULL);
|
|
|
|
ARCSTAT_BUMP(arcstat_deleted);
|
|
|
|
bytes_deleted += ab->b_size;
|
|
|
|
|
|
|
|
if (ab->b_l2hdr != NULL) {
|
|
|
|
/*
|
|
|
|
* This buffer is cached on the 2nd Level ARC;
|
|
|
|
* don't destroy the header.
|
|
|
|
*/
|
|
|
|
arc_change_state(arc_l2c_only, ab, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
|
|
|
arc_change_state(arc_anon, ab, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
arc_hdr_destroy(ab);
|
|
|
|
}
|
|
|
|
|
|
|
|
DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
|
|
|
|
if (bytes >= 0 && bytes_deleted >= bytes)
|
|
|
|
break;
|
2010-08-27 01:24:34 +04:00
|
|
|
} else if (bytes < 0) {
|
|
|
|
/*
|
|
|
|
* Insert a list marker and then wait for the
|
|
|
|
* hash lock to become available. Once its
|
|
|
|
* available, restart from where we left off.
|
|
|
|
*/
|
|
|
|
list_insert_after(list, ab, &marker);
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
mutex_enter(&state->arcs_mtx);
|
|
|
|
ab_prev = list_prev(list, &marker);
|
|
|
|
list_remove(list, &marker);
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
} else {
|
2008-11-20 23:01:55 +03:00
|
|
|
bufs_skipped += 1;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(&state->arcs_mtx);
|
|
|
|
|
|
|
|
if (list == &state->arcs_list[ARC_BUFC_DATA] &&
|
|
|
|
(bytes < 0 || bytes_deleted < bytes)) {
|
|
|
|
list = &state->arcs_list[ARC_BUFC_METADATA];
|
|
|
|
goto top;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bufs_skipped) {
|
|
|
|
ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
|
|
|
|
ASSERT(bytes >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bytes_deleted < bytes)
|
2010-08-26 21:28:31 +04:00
|
|
|
dprintf("only deleted %lld bytes from %p\n",
|
2008-11-20 23:01:55 +03:00
|
|
|
(longlong_t)bytes_deleted, state);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_adjust(void)
|
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
int64_t adjustment, delta;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust MRU size
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
adjustment = MIN((int64_t)(arc_size - arc_c),
|
2014-01-03 22:11:14 +04:00
|
|
|
(int64_t)(arc_anon->arcs_size + arc_mru->arcs_size - arc_p));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (adjustment > 0 && arc_mru->arcs_size > 0) {
|
|
|
|
delta = MIN(arc_mru->arcs_size, adjustment);
|
2010-08-26 20:52:39 +04:00
|
|
|
(void) arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/*
|
|
|
|
* Adjust MFU size
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
adjustment = arc_size - arc_c;
|
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (adjustment > 0 && arc_mfu->arcs_size > 0) {
|
|
|
|
delta = MIN(arc_mfu->arcs_size, adjustment);
|
2010-08-26 20:52:39 +04:00
|
|
|
(void) arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_DATA);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
/*
|
|
|
|
* Adjust ghost lists
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
|
|
|
|
|
|
|
|
if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
|
|
|
|
delta = MIN(arc_mru_ghost->arcs_size, adjustment);
|
2013-07-25 21:28:45 +04:00
|
|
|
arc_evict_ghost(arc_mru_ghost, 0, delta, ARC_BUFC_DATA);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
adjustment =
|
|
|
|
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
|
|
|
|
delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
|
2013-07-25 21:28:45 +04:00
|
|
|
arc_evict_ghost(arc_mfu_ghost, 0, delta, ARC_BUFC_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
/*
|
|
|
|
* Request that arc user drop references so that N bytes can be released
|
|
|
|
* from the cache. This provides a mechanism to ensure the arc can honor
|
|
|
|
* the arc_meta_limit and reclaim buffers which are pinned in the cache
|
|
|
|
* by higher layers. (i.e. the zpl)
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_do_user_prune(int64_t adjustment)
|
|
|
|
{
|
|
|
|
arc_prune_func_t *func;
|
|
|
|
void *private;
|
|
|
|
arc_prune_t *cp, *np;
|
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
|
|
|
|
cp = list_head(&arc_prune_list);
|
|
|
|
while (cp != NULL) {
|
|
|
|
func = cp->p_pfunc;
|
|
|
|
private = cp->p_private;
|
|
|
|
np = list_next(&arc_prune_list, cp);
|
|
|
|
refcount_add(&cp->p_refcnt, func);
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
if (func != NULL)
|
|
|
|
func(adjustment, private);
|
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
|
|
|
|
/* User removed prune callback concurrently with execution */
|
|
|
|
if (refcount_remove(&cp->p_refcnt, func) == 0) {
|
|
|
|
ASSERT(!list_link_active(&cp->p_node));
|
|
|
|
refcount_destroy(&cp->p_refcnt);
|
|
|
|
kmem_free(cp, sizeof (*cp));
|
|
|
|
}
|
|
|
|
|
|
|
|
cp = np;
|
|
|
|
}
|
|
|
|
|
|
|
|
ARCSTAT_BUMP(arcstat_prune);
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_do_user_evicts(void)
|
|
|
|
{
|
|
|
|
mutex_enter(&arc_eviction_mtx);
|
|
|
|
while (arc_eviction_list != NULL) {
|
|
|
|
arc_buf_t *buf = arc_eviction_list;
|
|
|
|
arc_eviction_list = buf->b_next;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_hdr = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&arc_eviction_mtx);
|
|
|
|
|
|
|
|
if (buf->b_efunc != NULL)
|
|
|
|
VERIFY(buf->b_efunc(buf) == 0);
|
|
|
|
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
kmem_cache_free(buf_cache, buf);
|
|
|
|
mutex_enter(&arc_eviction_mtx);
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_eviction_mtx);
|
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
/*
|
|
|
|
* Evict only meta data objects from the cache leaving the data objects.
|
|
|
|
* This is only used to enforce the tunable arc_meta_limit, if we are
|
|
|
|
* unable to evict enough buffers notify the user via the prune callback.
|
|
|
|
*/
|
2014-01-03 23:40:52 +04:00
|
|
|
static void
|
|
|
|
arc_adjust_meta(void)
|
2011-12-23 00:20:43 +04:00
|
|
|
{
|
2014-01-03 23:40:52 +04:00
|
|
|
int64_t adjustmnt, delta;
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2014-01-03 23:40:52 +04:00
|
|
|
/*
|
|
|
|
* This slightly differs than the way we evict from the mru in
|
|
|
|
* arc_adjust because we don't have a "target" value (i.e. no
|
|
|
|
* "meta" arc_p). As a result, I think we can completely
|
|
|
|
* cannibalize the metadata in the MRU before we evict the
|
|
|
|
* metadata from the MFU. I think we probably need to implement a
|
|
|
|
* "metadata arc_p" value to do this properly.
|
|
|
|
*/
|
|
|
|
adjustmnt = arc_meta_used - arc_meta_limit;
|
|
|
|
|
|
|
|
if (adjustmnt > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
|
|
|
|
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustmnt);
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_METADATA);
|
2014-01-03 23:40:52 +04:00
|
|
|
adjustmnt -= delta;
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
|
2014-01-03 23:40:52 +04:00
|
|
|
/*
|
|
|
|
* We can't afford to recalculate adjustmnt here. If we do,
|
|
|
|
* new metadata buffers can sneak into the MRU or ANON lists,
|
|
|
|
* thus penalize the MFU metadata. Although the fudge factor is
|
|
|
|
* small, it has been empirically shown to be significant for
|
|
|
|
* certain workloads (e.g. creating many empty directories). As
|
|
|
|
* such, we use the original calculation for adjustmnt, and
|
|
|
|
* simply decrement the amount of data evicted from the MRU.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (adjustmnt > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
|
|
|
|
delta = MIN(arc_mfu->arcs_lsize[ARC_BUFC_METADATA], adjustmnt);
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_METADATA);
|
|
|
|
}
|
|
|
|
|
2014-01-03 23:40:52 +04:00
|
|
|
adjustmnt = arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
|
|
|
|
arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit;
|
|
|
|
|
|
|
|
if (adjustmnt > 0 && arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) {
|
|
|
|
delta = MIN(adjustmnt,
|
|
|
|
arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA]);
|
|
|
|
arc_evict_ghost(arc_mru_ghost, 0, delta, ARC_BUFC_METADATA);
|
|
|
|
}
|
|
|
|
|
|
|
|
adjustmnt = arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] +
|
|
|
|
arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit;
|
|
|
|
|
|
|
|
if (adjustmnt > 0 && arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) {
|
|
|
|
delta = MIN(adjustmnt,
|
|
|
|
arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA]);
|
|
|
|
arc_evict_ghost(arc_mfu_ghost, 0, delta, ARC_BUFC_METADATA);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (arc_meta_used > arc_meta_limit)
|
2013-07-24 21:14:11 +04:00
|
|
|
arc_do_user_prune(zfs_arc_meta_prune);
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Flush all *evictable* data from the cache for the given spa.
|
|
|
|
* NOTE: this will not touch "active" (i.e. referenced) data.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_flush(spa_t *spa)
|
|
|
|
{
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t guid = 0;
|
|
|
|
|
|
|
|
if (spa)
|
2011-11-12 02:07:54 +04:00
|
|
|
guid = spa_load_guid(spa);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
while (list_head(&arc_mru->arcs_list[ARC_BUFC_DATA])) {
|
2009-02-18 23:51:31 +03:00
|
|
|
(void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
while (list_head(&arc_mru->arcs_list[ARC_BUFC_METADATA])) {
|
2009-02-18 23:51:31 +03:00
|
|
|
(void) arc_evict(arc_mru, guid, -1, FALSE, ARC_BUFC_METADATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
while (list_head(&arc_mfu->arcs_list[ARC_BUFC_DATA])) {
|
2009-02-18 23:51:31 +03:00
|
|
|
(void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
while (list_head(&arc_mfu->arcs_list[ARC_BUFC_METADATA])) {
|
2009-02-18 23:51:31 +03:00
|
|
|
(void) arc_evict(arc_mfu, guid, -1, FALSE, ARC_BUFC_METADATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (spa)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2013-07-25 21:28:45 +04:00
|
|
|
arc_evict_ghost(arc_mru_ghost, guid, -1, ARC_BUFC_DATA);
|
|
|
|
arc_evict_ghost(arc_mfu_ghost, guid, -1, ARC_BUFC_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_enter(&arc_reclaim_thr_lock);
|
|
|
|
arc_do_user_evicts();
|
|
|
|
mutex_exit(&arc_reclaim_thr_lock);
|
|
|
|
ASSERT(spa || arc_eviction_list == NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_shrink(uint64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
if (arc_c > arc_c_min) {
|
|
|
|
uint64_t to_free;
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
to_free = bytes ? bytes : arc_c >> zfs_arc_shrink_shift;
|
2012-03-14 01:29:16 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (arc_c > arc_c_min + to_free)
|
|
|
|
atomic_add_64(&arc_c, -to_free);
|
|
|
|
else
|
|
|
|
arc_c = arc_c_min;
|
|
|
|
|
2013-12-12 00:25:30 +04:00
|
|
|
to_free = bytes ? bytes : arc_p >> zfs_arc_shrink_shift;
|
|
|
|
|
2014-01-03 22:20:21 +04:00
|
|
|
if (arc_p > to_free)
|
2013-12-12 00:25:30 +04:00
|
|
|
atomic_add_64(&arc_p, -to_free);
|
|
|
|
else
|
2014-01-03 22:20:21 +04:00
|
|
|
arc_p = 0;
|
2013-12-12 00:25:30 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
if (arc_c > arc_size)
|
|
|
|
arc_c = MAX(arc_size, arc_c_min);
|
|
|
|
if (arc_p > arc_c)
|
|
|
|
arc_p = (arc_c >> 1);
|
|
|
|
ASSERT(arc_c >= arc_c_min);
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (arc_size > arc_c)
|
|
|
|
arc_adjust();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_kmem_reap_now(arc_reclaim_strategy_t strat, uint64_t bytes)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
kmem_cache_t *prev_cache = NULL;
|
|
|
|
kmem_cache_t *prev_data_cache = NULL;
|
|
|
|
extern kmem_cache_t *zio_buf_cache[];
|
|
|
|
extern kmem_cache_t *zio_data_buf_cache[];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* An aggressive reclamation will shrink the cache size as well as
|
|
|
|
* reap free buffers from the arc kmem caches.
|
|
|
|
*/
|
|
|
|
if (strat == ARC_RECLAIM_AGGR)
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_shrink(bytes);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
|
|
|
|
if (zio_buf_cache[i] != prev_cache) {
|
|
|
|
prev_cache = zio_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_buf_cache[i]);
|
|
|
|
}
|
|
|
|
if (zio_data_buf_cache[i] != prev_data_cache) {
|
|
|
|
prev_data_cache = zio_data_buf_cache[i];
|
|
|
|
kmem_cache_reap_now(zio_data_buf_cache[i]);
|
|
|
|
}
|
|
|
|
}
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_cache_reap_now(buf_cache);
|
|
|
|
kmem_cache_reap_now(hdr_cache);
|
|
|
|
}
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
|
|
|
* Unlike other ZFS implementations this thread is only responsible for
|
|
|
|
* adapting the target ARC size on Linux. The responsibility for memory
|
|
|
|
* reclamation has been entirely delegated to the arc_shrinker_func()
|
|
|
|
* which is registered with the VM. To reflect this change in behavior
|
|
|
|
* the arc_reclaim thread has been renamed to arc_adapt.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_adapt_thread(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
callb_cpr_t cpr;
|
|
|
|
|
|
|
|
CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&arc_reclaim_thr_lock);
|
|
|
|
while (arc_thread_exit == 0) {
|
2012-03-14 01:29:16 +04:00
|
|
|
#ifndef _KERNEL
|
|
|
|
arc_reclaim_strategy_t last_reclaim = ARC_RECLAIM_CONS;
|
|
|
|
|
|
|
|
if (spa_get_random(100) == 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (arc_no_grow) {
|
|
|
|
if (last_reclaim == ARC_RECLAIM_CONS) {
|
|
|
|
last_reclaim = ARC_RECLAIM_AGGR;
|
|
|
|
} else {
|
|
|
|
last_reclaim = ARC_RECLAIM_CONS;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
arc_no_grow = TRUE;
|
|
|
|
last_reclaim = ARC_RECLAIM_AGGR;
|
|
|
|
membar_producer();
|
|
|
|
}
|
|
|
|
|
|
|
|
/* reset the growth delay for every reclaim */
|
2013-11-01 23:26:11 +04:00
|
|
|
arc_grow_time = ddi_get_lbolt() +
|
|
|
|
(zfs_arc_grow_retry * hz);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_kmem_reap_now(last_reclaim, 0);
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_warm = B_TRUE;
|
2012-03-14 01:29:16 +04:00
|
|
|
}
|
|
|
|
#endif /* !_KERNEL */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/* No recent memory pressure allow the ARC to grow. */
|
2014-02-25 13:32:21 +04:00
|
|
|
if (arc_no_grow &&
|
|
|
|
ddi_time_after_eq(ddi_get_lbolt(), arc_grow_time))
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_no_grow = FALSE;
|
|
|
|
|
2014-01-03 23:40:52 +04:00
|
|
|
arc_adjust_meta();
|
2011-03-31 05:59:17 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
arc_adjust();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (arc_eviction_list != NULL)
|
|
|
|
arc_do_user_evicts();
|
|
|
|
|
|
|
|
/* block until needed, or one second, whichever is shorter */
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2010-12-10 23:00:00 +03:00
|
|
|
(void) cv_timedwait_interruptible(&arc_reclaim_thr_cv,
|
2010-05-29 00:45:14 +04:00
|
|
|
&arc_reclaim_thr_lock, (ddi_get_lbolt() + hz));
|
2008-11-20 23:01:55 +03:00
|
|
|
CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
|
2013-07-24 21:14:11 +04:00
|
|
|
|
|
|
|
|
|
|
|
/* Allow the module options to be changed */
|
|
|
|
if (zfs_arc_max > 64 << 20 &&
|
|
|
|
zfs_arc_max < physmem * PAGESIZE &&
|
|
|
|
zfs_arc_max != arc_c_max)
|
|
|
|
arc_c_max = zfs_arc_max;
|
|
|
|
|
|
|
|
if (zfs_arc_min > 0 &&
|
|
|
|
zfs_arc_min < arc_c_max &&
|
|
|
|
zfs_arc_min != arc_c_min)
|
|
|
|
arc_c_min = zfs_arc_min;
|
|
|
|
|
|
|
|
if (zfs_arc_meta_limit > 0 &&
|
|
|
|
zfs_arc_meta_limit <= arc_c_max &&
|
|
|
|
zfs_arc_meta_limit != arc_meta_limit)
|
|
|
|
arc_meta_limit = zfs_arc_meta_limit;
|
|
|
|
|
|
|
|
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
arc_thread_exit = 0;
|
|
|
|
cv_broadcast(&arc_reclaim_thr_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_thr_lock */
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
2012-03-14 01:29:16 +04:00
|
|
|
* Determine the amount of memory eligible for eviction contained in the
|
|
|
|
* ARC. All clean data reported by the ghost lists can always be safely
|
|
|
|
* evicted. Due to arc_c_min, the same does not hold for all clean data
|
|
|
|
* contained by the regular mru and mfu lists.
|
|
|
|
*
|
|
|
|
* In the case of the regular mru and mfu lists, we need to report as
|
|
|
|
* much clean data as possible, such that evicting that same reported
|
|
|
|
* data will not bring arc_size below arc_c_min. Thus, in certain
|
|
|
|
* circumstances, the total amount of clean data in the mru and mfu
|
|
|
|
* lists might not actually be evictable.
|
|
|
|
*
|
|
|
|
* The following two distinct cases are accounted for:
|
|
|
|
*
|
|
|
|
* 1. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is greater than or equal to arc_c_min.
|
|
|
|
* (i.e. amount of dirty data >= arc_c_min)
|
|
|
|
*
|
|
|
|
* This is the easy case; all clean data contained by the mru and mfu
|
|
|
|
* lists is evictable. Evicting all clean data can only drop arc_size
|
|
|
|
* to the amount of dirty data, which is greater than arc_c_min.
|
|
|
|
*
|
|
|
|
* 2. The sum of the amount of dirty data contained by both the mru and
|
|
|
|
* mfu lists, plus the ARC's other accounting (e.g. the anon list),
|
|
|
|
* is less than arc_c_min.
|
|
|
|
* (i.e. arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* 2.1. arc_size is greater than or equal arc_c_min.
|
|
|
|
* (i.e. arc_size >= arc_c_min > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, not all clean data from the regular mru and mfu
|
|
|
|
* lists is actually evictable; we must leave enough clean data
|
|
|
|
* to keep arc_size above arc_c_min. Thus, the maximum amount of
|
|
|
|
* evictable data from the two lists combined, is exactly the
|
|
|
|
* difference between arc_size and arc_c_min.
|
|
|
|
*
|
|
|
|
* 2.2. arc_size is less than arc_c_min
|
|
|
|
* (i.e. arc_c_min > arc_size > amount of dirty data)
|
|
|
|
*
|
|
|
|
* In this case, none of the data contained in the mru and mfu
|
|
|
|
* lists is evictable, even if it's clean. Since arc_size is
|
|
|
|
* already below arc_c_min, evicting any more would only
|
|
|
|
* increase this negative difference.
|
2011-03-30 05:08:59 +04:00
|
|
|
*/
|
2012-03-14 01:29:16 +04:00
|
|
|
static uint64_t
|
|
|
|
arc_evictable_memory(void) {
|
|
|
|
uint64_t arc_clean =
|
|
|
|
arc_mru->arcs_lsize[ARC_BUFC_DATA] +
|
|
|
|
arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
|
|
|
|
arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
|
|
|
|
arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
|
|
|
|
uint64_t ghost_clean =
|
|
|
|
arc_mru_ghost->arcs_lsize[ARC_BUFC_DATA] +
|
|
|
|
arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] +
|
|
|
|
arc_mfu_ghost->arcs_lsize[ARC_BUFC_DATA] +
|
|
|
|
arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA];
|
|
|
|
uint64_t arc_dirty = MAX((int64_t)arc_size - (int64_t)arc_clean, 0);
|
|
|
|
|
|
|
|
if (arc_dirty >= arc_c_min)
|
|
|
|
return (ghost_clean + arc_clean);
|
|
|
|
|
|
|
|
return (ghost_clean + MAX((int64_t)arc_size - (int64_t)arc_c_min, 0));
|
|
|
|
}
|
|
|
|
|
2011-06-22 01:26:51 +04:00
|
|
|
static int
|
|
|
|
__arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
|
2011-03-30 05:08:59 +04:00
|
|
|
{
|
2012-03-14 01:29:16 +04:00
|
|
|
uint64_t pages;
|
2011-03-30 05:08:59 +04:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/* The arc is considered warm once reclaim has occurred */
|
|
|
|
if (unlikely(arc_warm == B_FALSE))
|
|
|
|
arc_warm = B_TRUE;
|
2011-03-30 05:08:59 +04:00
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/* Return the potential number of reclaimable pages */
|
|
|
|
pages = btop(arc_evictable_memory());
|
|
|
|
if (sc->nr_to_scan == 0)
|
|
|
|
return (pages);
|
2011-05-09 23:18:46 +04:00
|
|
|
|
|
|
|
/* Not allowed to perform filesystem reclaim */
|
2011-06-22 01:26:51 +04:00
|
|
|
if (!(sc->gfp_mask & __GFP_FS))
|
2011-05-09 23:18:46 +04:00
|
|
|
return (-1);
|
|
|
|
|
2011-03-30 05:08:59 +04:00
|
|
|
/* Reclaim in progress */
|
|
|
|
if (mutex_tryenter(&arc_reclaim_thr_lock) == 0)
|
|
|
|
return (-1);
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
/*
|
|
|
|
* Evict the requested number of pages by shrinking arc_c the
|
|
|
|
* requested amount. If there is nothing left to evict just
|
|
|
|
* reap whatever we can from the various arc slabs.
|
|
|
|
*/
|
|
|
|
if (pages > 0) {
|
|
|
|
arc_kmem_reap_now(ARC_RECLAIM_AGGR, ptob(sc->nr_to_scan));
|
2013-12-23 23:34:20 +04:00
|
|
|
pages = btop(arc_evictable_memory());
|
2012-03-14 01:29:16 +04:00
|
|
|
} else {
|
|
|
|
arc_kmem_reap_now(ARC_RECLAIM_CONS, ptob(sc->nr_to_scan));
|
2013-12-23 23:34:20 +04:00
|
|
|
pages = -1;
|
2012-03-14 01:29:16 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When direct reclaim is observed it usually indicates a rapid
|
|
|
|
* increase in memory pressure. This occurs because the kswapd
|
|
|
|
* threads were unable to asynchronously keep enough free memory
|
|
|
|
* available. In this case set arc_no_grow to briefly pause arc
|
|
|
|
* growth to avoid compounding the memory pressure.
|
|
|
|
*/
|
2011-03-30 05:08:59 +04:00
|
|
|
if (current_is_kswapd()) {
|
2012-03-14 01:29:16 +04:00
|
|
|
ARCSTAT_BUMP(arcstat_memory_indirect_count);
|
2011-03-30 05:08:59 +04:00
|
|
|
} else {
|
2012-03-14 01:29:16 +04:00
|
|
|
arc_no_grow = B_TRUE;
|
2013-07-24 21:14:11 +04:00
|
|
|
arc_grow_time = ddi_get_lbolt() + (zfs_arc_grow_retry * hz);
|
2012-03-14 01:29:16 +04:00
|
|
|
ARCSTAT_BUMP(arcstat_memory_direct_count);
|
2011-03-30 05:08:59 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&arc_reclaim_thr_lock);
|
|
|
|
|
2013-12-23 23:34:20 +04:00
|
|
|
return (pages);
|
2011-03-30 05:08:59 +04:00
|
|
|
}
|
2011-06-22 01:26:51 +04:00
|
|
|
SPL_SHRINKER_CALLBACK_WRAPPER(arc_shrinker_func);
|
2011-03-30 05:08:59 +04:00
|
|
|
|
|
|
|
SPL_SHRINKER_DECLARE(arc_shrinker, arc_shrinker_func, DEFAULT_SEEKS);
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Adapt arc info given the number of bytes we are trying to add and
|
|
|
|
* the state that we are comming from. This function is only called
|
|
|
|
* when we are adding new content to the cache.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_adapt(int bytes, arc_state_t *state)
|
|
|
|
{
|
|
|
|
int mult;
|
|
|
|
|
|
|
|
if (state == arc_l2c_only)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ASSERT(bytes > 0);
|
|
|
|
/*
|
|
|
|
* Adapt the target size of the MRU list:
|
|
|
|
* - if we just hit in the MRU ghost list, then increase
|
|
|
|
* the target size of the MRU list.
|
|
|
|
* - if we just hit in the MFU ghost list, then increase
|
|
|
|
* the target size of the MFU list by decreasing the
|
|
|
|
* target size of the MRU list.
|
|
|
|
*/
|
|
|
|
if (state == arc_mru_ghost) {
|
|
|
|
mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
|
|
|
|
1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
|
2014-01-03 22:36:26 +04:00
|
|
|
|
|
|
|
if (!zfs_arc_p_dampener_disable)
|
|
|
|
mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-01-03 22:20:21 +04:00
|
|
|
arc_p = MIN(arc_c, arc_p + bytes * mult);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (state == arc_mfu_ghost) {
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t delta;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
|
|
|
|
1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
|
2014-01-03 22:36:26 +04:00
|
|
|
|
|
|
|
if (!zfs_arc_p_dampener_disable)
|
|
|
|
mult = MIN(mult, 10);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
delta = MIN(bytes * mult, arc_p);
|
2014-01-03 22:20:21 +04:00
|
|
|
arc_p = MAX(0, arc_p - delta);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
|
|
|
|
|
|
|
if (arc_no_grow)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (arc_c >= arc_c_max)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're within (2 * maxblocksize) bytes of the target
|
|
|
|
* cache size, increment the target cache size
|
|
|
|
*/
|
|
|
|
if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
|
|
|
|
atomic_add_64(&arc_c, (int64_t)bytes);
|
|
|
|
if (arc_c > arc_c_max)
|
|
|
|
arc_c = arc_c_max;
|
|
|
|
else if (state == arc_anon)
|
|
|
|
atomic_add_64(&arc_p, (int64_t)bytes);
|
|
|
|
if (arc_p > arc_c)
|
|
|
|
arc_p = arc_c;
|
|
|
|
}
|
|
|
|
ASSERT((int64_t)arc_p >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if the cache has reached its limits and eviction is required
|
|
|
|
* prior to insert.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
arc_evict_needed(arc_buf_contents_t type)
|
|
|
|
{
|
|
|
|
if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
|
|
|
|
return (1);
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
if (arc_no_grow)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (1);
|
|
|
|
|
|
|
|
return (arc_size > arc_c);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The buffer, supplied as the first argument, needs a data block.
|
|
|
|
* So, if we are at cache max, determine which cache should be victimized.
|
|
|
|
* We have the following cases:
|
|
|
|
*
|
|
|
|
* 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
|
|
|
|
* In this situation if we're out of space, but the resident size of the MFU is
|
|
|
|
* under the limit, victimize the MFU cache to satisfy this insertion request.
|
|
|
|
*
|
|
|
|
* 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
|
|
|
|
* Here, we've used up all of the available space for the MRU, so we need to
|
|
|
|
* evict from our own cache instead. Evict from the set of resident MRU
|
|
|
|
* entries.
|
|
|
|
*
|
|
|
|
* 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
|
|
|
|
* c minus p represents the MFU space in the cache, since p is the size of the
|
|
|
|
* cache that is dedicated to the MRU. In this situation there's still space on
|
|
|
|
* the MFU side, so the MRU side needs to be victimized.
|
|
|
|
*
|
|
|
|
* 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
|
|
|
|
* MFU's resident set is consuming more space than it has been allotted. In
|
|
|
|
* this situation, we must victimize our own cache, the MFU, for this insertion.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_get_data_buf(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_state_t *state = buf->b_hdr->b_state;
|
|
|
|
uint64_t size = buf->b_hdr->b_size;
|
|
|
|
arc_buf_contents_t type = buf->b_hdr->b_type;
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
arc_buf_contents_t evict = ARC_BUFC_DATA;
|
|
|
|
boolean_t recycle = TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
arc_adapt(size, state);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have not yet reached cache maximum size,
|
|
|
|
* just allocate a new buffer.
|
|
|
|
*/
|
|
|
|
if (!arc_evict_needed(type)) {
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
buf->b_data = zio_buf_alloc(size);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_consume(size, ARC_SPACE_META);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
buf->b_data = zio_data_buf_alloc(size);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_consume(size, ARC_SPACE_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are prefetching from the mfu ghost list, this buffer
|
|
|
|
* will end up on the mru list; so steal space from there.
|
|
|
|
*/
|
|
|
|
if (state == arc_mfu_ghost)
|
|
|
|
state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
|
|
|
|
else if (state == arc_mru_ghost)
|
|
|
|
state = arc_mru;
|
|
|
|
|
|
|
|
if (state == arc_mru || state == arc_anon) {
|
|
|
|
uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
|
2009-02-18 23:51:31 +03:00
|
|
|
state = (arc_mfu->arcs_lsize[type] >= size &&
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_p > mru_used) ? arc_mfu : arc_mru;
|
|
|
|
} else {
|
|
|
|
/* MFU cases */
|
|
|
|
uint64_t mfu_space = arc_c - arc_p;
|
2009-02-18 23:51:31 +03:00
|
|
|
state = (arc_mru->arcs_lsize[type] >= size &&
|
2008-11-20 23:01:55 +03:00
|
|
|
mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
|
|
|
|
}
|
2011-12-23 00:20:43 +04:00
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
/*
|
|
|
|
* Evict data buffers prior to metadata buffers, unless we're
|
|
|
|
* over the metadata limit and adding a metadata buffer.
|
|
|
|
*/
|
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
if (arc_meta_used >= arc_meta_limit)
|
|
|
|
evict = ARC_BUFC_METADATA;
|
|
|
|
else
|
|
|
|
/*
|
|
|
|
* In this case, we're evicting data while
|
|
|
|
* adding metadata. Thus, to prevent recycling a
|
|
|
|
* data buffer into a metadata buffer, recycling
|
|
|
|
* is disabled in the following arc_evict call.
|
|
|
|
*/
|
|
|
|
recycle = FALSE;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((buf->b_data = arc_evict(state, 0, size, recycle, evict)) == NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
if (type == ARC_BUFC_METADATA) {
|
|
|
|
buf->b_data = zio_buf_alloc(size);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_consume(size, ARC_SPACE_META);
|
2011-12-23 00:20:43 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are unable to recycle an existing meta buffer
|
|
|
|
* signal the reclaim thread. It will notify users
|
|
|
|
* via the prune callback to drop references. The
|
|
|
|
* prune callback in run in the context of the reclaim
|
|
|
|
* thread to avoid deadlocking on the hash_lock.
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
* Of course, only do this when recycle is true.
|
2011-12-23 00:20:43 +04:00
|
|
|
*/
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
if (recycle)
|
|
|
|
cv_signal(&arc_reclaim_thr_cv);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
ASSERT(type == ARC_BUFC_DATA);
|
|
|
|
buf->b_data = zio_data_buf_alloc(size);
|
2014-02-04 00:41:47 +04:00
|
|
|
arc_space_consume(size, ARC_SPACE_DATA);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2011-12-23 00:20:43 +04:00
|
|
|
|
Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-30 21:30:00 +04:00
|
|
|
/* Only bump this if we tried to recycle and failed */
|
|
|
|
if (recycle)
|
|
|
|
ARCSTAT_BUMP(arcstat_recycle_miss);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
out:
|
|
|
|
/*
|
|
|
|
* Update the state size. Note that ghost states have a
|
|
|
|
* "ghost size" and so don't need to be updated.
|
|
|
|
*/
|
|
|
|
if (!GHOST_STATE(buf->b_hdr->b_state)) {
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
atomic_add_64(&hdr->b_state->arcs_size, size);
|
|
|
|
if (list_link_active(&hdr->b_arc_node)) {
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If we are growing the cache, and we are adding anonymous
|
|
|
|
* data, and we have outgrown arc_p, update arc_p
|
|
|
|
*/
|
Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-11 21:40:13 +04:00
|
|
|
if (!zfs_arc_p_aggressive_disable &&
|
|
|
|
arc_size < arc_c && hdr->b_state == arc_anon &&
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
|
|
|
|
arc_p = MIN(arc_c, arc_p + size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This routine is called whenever a buffer is accessed.
|
|
|
|
* NOTE: the hash lock is dropped in this function.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t now;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(MUTEX_HELD(hash_lock));
|
|
|
|
|
|
|
|
if (buf->b_state == arc_anon) {
|
|
|
|
/*
|
|
|
|
* This buffer is not in the cache, and does not
|
|
|
|
* appear in our "ghost" list. Add the new buffer
|
|
|
|
* to the MRU state.
|
|
|
|
*/
|
|
|
|
|
|
|
|
ASSERT(buf->b_arc_access == 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
|
|
|
|
arc_change_state(arc_mru, buf, hash_lock);
|
|
|
|
|
|
|
|
} else if (buf->b_state == arc_mru) {
|
2010-05-29 00:45:14 +04:00
|
|
|
now = ddi_get_lbolt();
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If this buffer is here because of a prefetch, then either:
|
|
|
|
* - clear the flag if this is a "referencing" read
|
|
|
|
* (any subsequent access will bump this into the MFU state).
|
|
|
|
* or
|
|
|
|
* - move the buffer to the head of the list if this is
|
|
|
|
* another prefetch (to make it less likely to be evicted).
|
|
|
|
*/
|
|
|
|
if ((buf->b_flags & ARC_PREFETCH) != 0) {
|
|
|
|
if (refcount_count(&buf->b_refcnt) == 0) {
|
|
|
|
ASSERT(list_link_active(&buf->b_arc_node));
|
|
|
|
} else {
|
|
|
|
buf->b_flags &= ~ARC_PREFETCH;
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&buf->b_mru_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_hits);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = now;
|
2008-11-20 23:01:55 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This buffer has been "accessed" only once so far,
|
|
|
|
* but it is still in the cache. Move it to the MFU
|
|
|
|
* state.
|
|
|
|
*/
|
2014-02-25 13:32:21 +04:00
|
|
|
if (ddi_time_after(now, buf->b_arc_access + ARC_MINTIME)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* More than 125ms have passed since we
|
|
|
|
* instantiated this buffer. Move it to the
|
|
|
|
* most frequently used state.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = now;
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
|
|
|
|
arc_change_state(arc_mfu, buf, hash_lock);
|
|
|
|
}
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&buf->b_mru_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_hits);
|
|
|
|
} else if (buf->b_state == arc_mru_ghost) {
|
|
|
|
arc_state_t *new_state;
|
|
|
|
/*
|
|
|
|
* This buffer has been "accessed" recently, but
|
|
|
|
* was evicted from the cache. Move it to the
|
|
|
|
* MFU state.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (buf->b_flags & ARC_PREFETCH) {
|
|
|
|
new_state = arc_mru;
|
|
|
|
if (refcount_count(&buf->b_refcnt) > 0)
|
|
|
|
buf->b_flags &= ~ARC_PREFETCH;
|
|
|
|
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
|
|
|
|
} else {
|
|
|
|
new_state = arc_mfu;
|
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_change_state(new_state, buf, hash_lock);
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&buf->b_mru_ghost_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mru_ghost_hits);
|
|
|
|
} else if (buf->b_state == arc_mfu) {
|
|
|
|
/*
|
|
|
|
* This buffer has been accessed more than once and is
|
|
|
|
* still in the cache. Keep it in the MFU state.
|
|
|
|
*
|
|
|
|
* NOTE: an add_reference() that occurred when we did
|
|
|
|
* the arc_read() will have kicked this off the list.
|
|
|
|
* If it was a prefetch, we will explicitly move it to
|
|
|
|
* the head of the list now.
|
|
|
|
*/
|
|
|
|
if ((buf->b_flags & ARC_PREFETCH) != 0) {
|
|
|
|
ASSERT(refcount_count(&buf->b_refcnt) == 0);
|
|
|
|
ASSERT(list_link_active(&buf->b_arc_node));
|
|
|
|
}
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&buf->b_mfu_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mfu_hits);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (buf->b_state == arc_mfu_ghost) {
|
|
|
|
arc_state_t *new_state = arc_mfu;
|
|
|
|
/*
|
|
|
|
* This buffer has been accessed more than once but has
|
|
|
|
* been evicted from the cache. Move it back to the
|
|
|
|
* MFU state.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (buf->b_flags & ARC_PREFETCH) {
|
|
|
|
/*
|
|
|
|
* This is a prefetch access...
|
|
|
|
* move this block back to the MRU state.
|
|
|
|
*/
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(refcount_count(&buf->b_refcnt));
|
2008-11-20 23:01:55 +03:00
|
|
|
new_state = arc_mru;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
|
|
|
|
arc_change_state(new_state, buf, hash_lock);
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&buf->b_mfu_ghost_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
|
|
|
|
} else if (buf->b_state == arc_l2c_only) {
|
|
|
|
/*
|
|
|
|
* This buffer is on the 2nd Level ARC.
|
|
|
|
*/
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_arc_access = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
|
|
|
|
arc_change_state(arc_mfu, buf, hash_lock);
|
|
|
|
} else {
|
|
|
|
ASSERT(!"invalid arc state");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* a generic arc_done_func_t which you can use */
|
|
|
|
/* ARGSUSED */
|
|
|
|
void
|
|
|
|
arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio == NULL || zio->io_error == 0)
|
|
|
|
bcopy(buf->b_data, arg, buf->b_hdr->b_size);
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY(arc_buf_remove_ref(buf, arg));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* a generic arc_done_func_t */
|
|
|
|
void
|
|
|
|
arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
|
|
|
|
{
|
|
|
|
arc_buf_t **bufp = arg;
|
|
|
|
if (zio && zio->io_error) {
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY(arc_buf_remove_ref(buf, arg));
|
2008-11-20 23:01:55 +03:00
|
|
|
*bufp = NULL;
|
|
|
|
} else {
|
|
|
|
*bufp = buf;
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_data);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_read_done(zio_t *zio)
|
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *hdr;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_t *buf;
|
|
|
|
arc_buf_t *abuf; /* buffer we're assigning to callback */
|
2014-06-06 01:19:08 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_callback_t *callback_list, *acb;
|
|
|
|
int freeable = FALSE;
|
|
|
|
|
|
|
|
buf = zio->io_private;
|
|
|
|
hdr = buf->b_hdr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The hdr was inserted into hash-table and removed from lists
|
|
|
|
* prior to starting I/O. We should find this header, since
|
|
|
|
* it's in the hash table, and it should be legit since it's
|
|
|
|
* not possible to evict it during the I/O. The only possible
|
|
|
|
* reason for it not to be found is if we were freed during the
|
|
|
|
* read.
|
|
|
|
*/
|
2014-06-06 01:19:08 +04:00
|
|
|
if (HDR_IN_HASH_TABLE(hdr)) {
|
|
|
|
arc_buf_hdr_t *found;
|
|
|
|
|
|
|
|
ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp));
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[0], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[0]);
|
|
|
|
ASSERT3U(hdr->b_dva.dva_word[1], ==,
|
|
|
|
BP_IDENTITY(zio->io_bp)->dva_word[1]);
|
|
|
|
|
|
|
|
found = buf_hash_find(hdr->b_spa, zio->io_bp,
|
|
|
|
&hash_lock);
|
|
|
|
|
|
|
|
ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) &&
|
|
|
|
hash_lock == NULL) ||
|
|
|
|
(found == hdr &&
|
|
|
|
DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
|
|
|
|
(found == hdr && HDR_L2_READING(hdr)));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
hdr->b_flags &= ~ARC_L2_EVICTED;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
|
2008-12-03 23:09:06 +03:00
|
|
|
hdr->b_flags &= ~ARC_L2CACHE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* byteswap if necessary */
|
|
|
|
callback_list = hdr->b_acb;
|
|
|
|
ASSERT(callback_list != NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
|
2012-12-14 03:24:15 +04:00
|
|
|
dmu_object_byteswap_t bswap =
|
|
|
|
DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));
|
2013-02-15 08:37:43 +04:00
|
|
|
if (BP_GET_LEVEL(zio->io_bp) > 0)
|
|
|
|
byteswap_uint64_array(buf->b_data, hdr->b_size);
|
|
|
|
else
|
|
|
|
dmu_ot_byteswap[bswap].ob_func(buf->b_data, hdr->b_size);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
arc_cksum_compute(buf, B_FALSE);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_watch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (hash_lock && zio->io_error == 0 && hdr->b_state == arc_anon) {
|
|
|
|
/*
|
|
|
|
* Only call arc_access on anonymous buffers. This is because
|
|
|
|
* if we've issued an I/O for an evicted buffer, we've already
|
|
|
|
* called arc_access (to prevent any simultaneous readers from
|
|
|
|
* getting confused).
|
|
|
|
*/
|
|
|
|
arc_access(hdr, hash_lock);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* create copies of the data buffer for the callers */
|
|
|
|
abuf = buf;
|
|
|
|
for (acb = callback_list; acb; acb = acb->acb_next) {
|
|
|
|
if (acb->acb_done) {
|
2012-12-22 02:57:09 +04:00
|
|
|
if (abuf == NULL) {
|
|
|
|
ARCSTAT_BUMP(arcstat_duplicate_reads);
|
2008-11-20 23:01:55 +03:00
|
|
|
abuf = arc_buf_clone(buf);
|
2012-12-22 02:57:09 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_buf = abuf;
|
|
|
|
abuf = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
hdr->b_acb = NULL;
|
|
|
|
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
|
|
|
|
ASSERT(!HDR_BUF_AVAILABLE(hdr));
|
2010-05-29 00:45:14 +04:00
|
|
|
if (abuf == buf) {
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
|
|
|
ASSERT(hdr->b_datacnt == 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_flags |= ARC_BUF_AVAILABLE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
|
|
|
|
|
|
|
|
if (zio->io_error != 0) {
|
|
|
|
hdr->b_flags |= ARC_IO_ERROR;
|
|
|
|
if (hdr->b_state != arc_anon)
|
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
|
|
|
if (HDR_IN_HASH_TABLE(hdr))
|
|
|
|
buf_hash_remove(hdr);
|
|
|
|
freeable = refcount_is_zero(&hdr->b_refcnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Broadcast before we drop the hash_lock to avoid the possibility
|
|
|
|
* that the hdr (and hence the cv) might be freed before we get to
|
|
|
|
* the cv_broadcast().
|
|
|
|
*/
|
|
|
|
cv_broadcast(&hdr->b_cv);
|
|
|
|
|
|
|
|
if (hash_lock) {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This block was freed while we waited for the read to
|
|
|
|
* complete. It has been removed from the hash table and
|
|
|
|
* moved to the anonymous state (so that it won't show up
|
|
|
|
* in the cache).
|
|
|
|
*/
|
|
|
|
ASSERT3P(hdr->b_state, ==, arc_anon);
|
|
|
|
freeable = refcount_is_zero(&hdr->b_refcnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* execute each callback and free its structure */
|
|
|
|
while ((acb = callback_list) != NULL) {
|
|
|
|
if (acb->acb_done)
|
|
|
|
acb->acb_done(zio, acb->acb_buf, acb->acb_private);
|
|
|
|
|
|
|
|
if (acb->acb_zio_dummy != NULL) {
|
|
|
|
acb->acb_zio_dummy->io_error = zio->io_error;
|
|
|
|
zio_nowait(acb->acb_zio_dummy);
|
|
|
|
}
|
|
|
|
|
|
|
|
callback_list = acb->acb_next;
|
|
|
|
kmem_free(acb, sizeof (arc_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (freeable)
|
|
|
|
arc_hdr_destroy(hdr);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-11 20:54:18 +04:00
|
|
|
* "Read" the block at the specified DVA (in bp) via the
|
2008-11-20 23:01:55 +03:00
|
|
|
* cache. If the block is found in the cache, invoke the provided
|
|
|
|
* callback immediately and return. Note that the `zio' parameter
|
|
|
|
* in the callback will be NULL in this case, since no IO was
|
|
|
|
* required. If the block is not in the cache pass the read request
|
|
|
|
* on to the spa with a substitute callback function, so that the
|
|
|
|
* requested block will be added to the cache.
|
|
|
|
*
|
|
|
|
* If a read request arrives for a block that has a read in-progress,
|
|
|
|
* either wait for the in-progress read to complete (and return the
|
|
|
|
* results); or, if this is a read with a "done" func, add a record
|
|
|
|
* to the read to invoke the "done" func when the read completes,
|
|
|
|
* and return; or just return.
|
|
|
|
*
|
|
|
|
* arc_read_done() will invoke all the requested "done" functions
|
|
|
|
* for readers of this block.
|
|
|
|
*/
|
|
|
|
int
|
2013-07-03 00:26:24 +04:00
|
|
|
arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_done_func_t *done,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
void *private, zio_priority_t priority, int zio_flags, uint32_t *arc_flags,
|
2014-06-25 22:37:59 +04:00
|
|
|
const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *hdr = NULL;
|
2010-08-26 20:58:04 +04:00
|
|
|
arc_buf_t *buf = NULL;
|
2014-06-06 01:19:08 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *rzio;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t guid = spa_load_guid(spa);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
int rc = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp) ||
|
|
|
|
BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
top:
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
/*
|
|
|
|
* Embedded BP's have no DVA and require no I/O to "read".
|
|
|
|
* Create an anonymous arc buf to back it.
|
|
|
|
*/
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (hdr != NULL && hdr->b_datacnt > 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
*arc_flags |= ARC_CACHED;
|
|
|
|
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
|
|
|
|
if (*arc_flags & ARC_WAIT) {
|
|
|
|
cv_wait(&hdr->b_cv, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
|
|
|
}
|
|
|
|
ASSERT(*arc_flags & ARC_NOWAIT);
|
|
|
|
|
|
|
|
if (done) {
|
|
|
|
arc_callback_t *acb = NULL;
|
|
|
|
|
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t),
|
2011-03-20 00:34:30 +03:00
|
|
|
KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
|
|
|
if (pio != NULL)
|
|
|
|
acb->acb_zio_dummy = zio_null(pio,
|
2009-02-18 23:51:31 +03:00
|
|
|
spa, NULL, NULL, NULL, zio_flags);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(acb->acb_done != NULL);
|
|
|
|
acb->acb_next = hdr->b_acb;
|
|
|
|
hdr->b_acb = acb;
|
|
|
|
add_reference(hdr, hash_lock, private);
|
|
|
|
mutex_exit(hash_lock);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
|
|
|
|
|
|
|
|
if (done) {
|
|
|
|
add_reference(hdr, hash_lock, private);
|
|
|
|
/*
|
|
|
|
* If this block is already in use, create a new
|
|
|
|
* copy of the data so that we will be guaranteed
|
|
|
|
* that arc_release() will always succeed.
|
|
|
|
*/
|
|
|
|
buf = hdr->b_buf;
|
|
|
|
ASSERT(buf);
|
|
|
|
ASSERT(buf->b_data);
|
|
|
|
if (HDR_BUF_AVAILABLE(hdr)) {
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
|
|
|
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
|
|
|
|
} else {
|
|
|
|
buf = arc_buf_clone(buf);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
} else if (*arc_flags & ARC_PREFETCH &&
|
|
|
|
refcount_count(&hdr->b_refcnt) == 0) {
|
|
|
|
hdr->b_flags |= ARC_PREFETCH;
|
|
|
|
}
|
|
|
|
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
arc_access(hdr, hash_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (*arc_flags & ARC_L2CACHE)
|
|
|
|
hdr->b_flags |= ARC_L2CACHE;
|
2013-08-02 00:02:10 +04:00
|
|
|
if (*arc_flags & ARC_L2COMPRESS)
|
|
|
|
hdr->b_flags |= ARC_L2COMPRESS;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
ARCSTAT_BUMP(arcstat_hits);
|
|
|
|
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
|
|
|
|
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
|
|
|
|
data, metadata, hits);
|
|
|
|
|
|
|
|
if (done)
|
|
|
|
done(NULL, buf, private);
|
|
|
|
} else {
|
|
|
|
uint64_t size = BP_GET_LSIZE(bp);
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_callback_t *acb;
|
2008-12-03 23:09:06 +03:00
|
|
|
vdev_t *vd = NULL;
|
2013-02-11 10:21:05 +04:00
|
|
|
uint64_t addr = 0;
|
2009-02-18 23:51:31 +03:00
|
|
|
boolean_t devw = B_FALSE;
|
2014-03-21 03:55:09 +04:00
|
|
|
enum zio_compress b_compress = ZIO_COMPRESS_OFF;
|
|
|
|
uint64_t b_asize = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (hdr == NULL) {
|
|
|
|
/* this block is not in the cache */
|
2014-06-06 01:19:08 +04:00
|
|
|
arc_buf_hdr_t *exists = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
|
|
|
|
buf = arc_buf_alloc(spa, size, private, type);
|
|
|
|
hdr = buf->b_hdr;
|
2014-06-06 01:19:08 +04:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
hdr->b_dva = *BP_IDENTITY(bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(bp);
|
|
|
|
hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
}
|
|
|
|
if (exists != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/* somebody beat us to the hash insert */
|
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) arc_buf_remove_ref(buf, private);
|
|
|
|
goto top; /* restart the IO request */
|
|
|
|
}
|
|
|
|
/* if this is a prefetch, we don't have a reference */
|
|
|
|
if (*arc_flags & ARC_PREFETCH) {
|
|
|
|
(void) remove_reference(hdr, hash_lock,
|
|
|
|
private);
|
|
|
|
hdr->b_flags |= ARC_PREFETCH;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
if (*arc_flags & ARC_L2CACHE)
|
|
|
|
hdr->b_flags |= ARC_L2CACHE;
|
2013-08-02 00:02:10 +04:00
|
|
|
if (*arc_flags & ARC_L2COMPRESS)
|
|
|
|
hdr->b_flags |= ARC_L2COMPRESS;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (BP_GET_LEVEL(bp) > 0)
|
|
|
|
hdr->b_flags |= ARC_INDIRECT;
|
|
|
|
} else {
|
|
|
|
/* this block is in the ghost cache */
|
|
|
|
ASSERT(GHOST_STATE(hdr->b_state));
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(refcount_count(&hdr->b_refcnt));
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(hdr->b_buf == NULL);
|
|
|
|
|
|
|
|
/* if this is a prefetch, we don't have a reference */
|
|
|
|
if (*arc_flags & ARC_PREFETCH)
|
|
|
|
hdr->b_flags |= ARC_PREFETCH;
|
|
|
|
else
|
|
|
|
add_reference(hdr, hash_lock, private);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (*arc_flags & ARC_L2CACHE)
|
|
|
|
hdr->b_flags |= ARC_L2CACHE;
|
2013-08-02 00:02:10 +04:00
|
|
|
if (*arc_flags & ARC_L2COMPRESS)
|
|
|
|
hdr->b_flags |= ARC_L2COMPRESS;
|
2008-11-20 23:01:55 +03:00
|
|
|
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
|
|
|
|
buf->b_hdr = hdr;
|
|
|
|
buf->b_data = NULL;
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
buf->b_next = NULL;
|
|
|
|
hdr->b_buf = buf;
|
|
|
|
ASSERT(hdr->b_datacnt == 0);
|
|
|
|
hdr->b_datacnt = 1;
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_get_data_buf(buf);
|
|
|
|
arc_access(hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(!GHOST_STATE(hdr->b_state));
|
|
|
|
|
2011-03-20 00:34:30 +03:00
|
|
|
acb = kmem_zalloc(sizeof (arc_callback_t), KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
acb->acb_done = done;
|
|
|
|
acb->acb_private = private;
|
|
|
|
|
|
|
|
ASSERT(hdr->b_acb == NULL);
|
|
|
|
hdr->b_acb = acb;
|
|
|
|
hdr->b_flags |= ARC_IO_IN_PROGRESS;
|
|
|
|
|
2014-03-21 03:55:09 +04:00
|
|
|
if (hdr->b_l2hdr != NULL &&
|
2008-12-03 23:09:06 +03:00
|
|
|
(vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
|
2009-02-18 23:51:31 +03:00
|
|
|
devw = hdr->b_l2hdr->b_dev->l2ad_writing;
|
2008-12-03 23:09:06 +03:00
|
|
|
addr = hdr->b_l2hdr->b_daddr;
|
2014-03-21 03:55:09 +04:00
|
|
|
b_compress = hdr->b_l2hdr->b_compress;
|
|
|
|
b_asize = hdr->b_l2hdr->b_asize;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Lock out device removal.
|
|
|
|
*/
|
|
|
|
if (vdev_is_dead(vd) ||
|
|
|
|
!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
|
|
|
|
vd = NULL;
|
|
|
|
}
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
if (hash_lock != NULL)
|
|
|
|
mutex_exit(hash_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2013-06-11 21:12:34 +04:00
|
|
|
/*
|
|
|
|
* At this point, we have a level 1 cache miss. Try again in
|
|
|
|
* L2ARC if possible.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT3U(hdr->b_size, ==, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp,
|
2014-06-25 22:37:59 +04:00
|
|
|
uint64_t, size, zbookmark_phys_t *, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_misses);
|
|
|
|
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
|
|
|
|
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
|
|
|
|
data, metadata, misses);
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Read from the L2ARC if the following are true:
|
2008-12-03 23:09:06 +03:00
|
|
|
* 1. The L2ARC vdev was previously cached.
|
|
|
|
* 2. This buffer still has L2ARC metadata.
|
|
|
|
* 3. This buffer isn't currently writing to the L2ARC.
|
|
|
|
* 4. The L2ARC entry wasn't evicted, which may
|
|
|
|
* also have invalidated the vdev.
|
2009-02-18 23:51:31 +03:00
|
|
|
* 5. This isn't prefetch and l2arc_noprefetch is set.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (hdr->b_l2hdr != NULL &&
|
2009-02-18 23:51:31 +03:00
|
|
|
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
|
|
|
|
!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
|
|
|
|
DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_hits);
|
2013-10-03 04:11:19 +04:00
|
|
|
atomic_inc_32(&hdr->b_l2hdr->b_hits);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
|
2011-03-20 00:34:30 +03:00
|
|
|
KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2rcb_buf = buf;
|
|
|
|
cb->l2rcb_spa = spa;
|
|
|
|
cb->l2rcb_bp = *bp;
|
|
|
|
cb->l2rcb_zb = *zb;
|
2008-12-03 23:09:06 +03:00
|
|
|
cb->l2rcb_flags = zio_flags;
|
2014-03-21 03:55:09 +04:00
|
|
|
cb->l2rcb_compress = b_compress;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-02-11 10:21:05 +04:00
|
|
|
ASSERT(addr >= VDEV_LABEL_START_SIZE &&
|
|
|
|
addr + size < vd->vdev_psize -
|
|
|
|
VDEV_LABEL_END_SIZE);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc read. The SCL_L2ARC lock will be
|
|
|
|
* released by l2arc_read_done().
|
2013-08-02 00:02:10 +04:00
|
|
|
* Issue a null zio if the underlying buffer
|
|
|
|
* was squashed to zero size by compression.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-03-21 03:55:09 +04:00
|
|
|
if (b_compress == ZIO_COMPRESS_EMPTY) {
|
2013-08-02 00:02:10 +04:00
|
|
|
rzio = zio_null(pio, spa, vd,
|
|
|
|
l2arc_read_done, cb,
|
|
|
|
zio_flags | ZIO_FLAG_DONT_CACHE |
|
|
|
|
ZIO_FLAG_CANFAIL |
|
|
|
|
ZIO_FLAG_DONT_PROPAGATE |
|
|
|
|
ZIO_FLAG_DONT_RETRY);
|
|
|
|
} else {
|
|
|
|
rzio = zio_read_phys(pio, vd, addr,
|
2014-03-21 03:55:09 +04:00
|
|
|
b_asize, buf->b_data,
|
|
|
|
ZIO_CHECKSUM_OFF,
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_read_done, cb, priority,
|
|
|
|
zio_flags | ZIO_FLAG_DONT_CACHE |
|
|
|
|
ZIO_FLAG_CANFAIL |
|
|
|
|
ZIO_FLAG_DONT_PROPAGATE |
|
|
|
|
ZIO_FLAG_DONT_RETRY, B_FALSE);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
|
|
|
|
zio_t *, rzio);
|
2014-03-21 03:55:09 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_read_bytes, b_asize);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (*arc_flags & ARC_NOWAIT) {
|
|
|
|
zio_nowait(rzio);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(*arc_flags & ARC_WAIT);
|
|
|
|
if (zio_wait(rzio) == 0)
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
goto out;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/* l2arc read error; goto zio_read() */
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
if (HDR_L2_WRITING(hdr))
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_rw_clash);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
} else {
|
|
|
|
if (vd != NULL)
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, vd);
|
|
|
|
if (l2arc_ndev != 0) {
|
|
|
|
DTRACE_PROBE1(l2arc__miss,
|
|
|
|
arc_buf_hdr_t *, hdr);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_misses);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
rzio = zio_read(pio, spa, bp, buf->b_data, size,
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_read_done, buf, priority, zio_flags, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
if (*arc_flags & ARC_WAIT) {
|
|
|
|
rc = zio_wait(rzio);
|
|
|
|
goto out;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(*arc_flags & ARC_NOWAIT);
|
|
|
|
zio_nowait(rzio);
|
|
|
|
}
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-07 03:09:05 +04:00
|
|
|
|
|
|
|
out:
|
|
|
|
spa_read_history_add(spa, zb, *arc_flags);
|
|
|
|
return (rc);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *
|
|
|
|
arc_add_prune_callback(arc_prune_func_t *func, void *private)
|
|
|
|
{
|
|
|
|
arc_prune_t *p;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
p = kmem_alloc(sizeof (*p), KM_SLEEP);
|
2011-12-23 00:20:43 +04:00
|
|
|
p->p_pfunc = func;
|
|
|
|
p->p_private = private;
|
|
|
|
list_link_init(&p->p_node);
|
|
|
|
refcount_create(&p->p_refcnt);
|
|
|
|
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
refcount_add(&p->p_refcnt, &arc_prune_list);
|
|
|
|
list_insert_head(&arc_prune_list, p);
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
return (p);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_remove_prune_callback(arc_prune_t *p)
|
|
|
|
{
|
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
list_remove(&arc_prune_list, p);
|
|
|
|
if (refcount_remove(&p->p_refcnt, &arc_prune_list) == 0) {
|
|
|
|
refcount_destroy(&p->p_refcnt);
|
|
|
|
kmem_free(p, sizeof (*p));
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
|
|
|
|
{
|
|
|
|
ASSERT(buf->b_hdr != NULL);
|
|
|
|
ASSERT(buf->b_hdr->b_state != arc_anon);
|
|
|
|
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(buf->b_efunc == NULL);
|
|
|
|
ASSERT(!HDR_BUF_AVAILABLE(buf->b_hdr));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_efunc = func;
|
|
|
|
buf->b_private = private;
|
|
|
|
}
|
|
|
|
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
/*
|
|
|
|
* Notify the arc that a block was freed, and thus will never be used again.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_freed(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t guid = spa_load_guid(spa);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
|
|
|
|
|
|
|
hdr = buf_hash_find(guid, bp, &hash_lock);
|
Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f52089b9026785bd90257a3d3f6e5ee2
https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-07 02:46:55 +04:00
|
|
|
if (hdr == NULL)
|
|
|
|
return;
|
|
|
|
if (HDR_BUF_AVAILABLE(hdr)) {
|
|
|
|
arc_buf_t *buf = hdr->b_buf;
|
|
|
|
add_reference(hdr, hash_lock, FTAG);
|
|
|
|
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
arc_release(buf, FTAG);
|
|
|
|
(void) arc_buf_remove_ref(buf, FTAG);
|
|
|
|
} else {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* This is used by the DMU to let the ARC know that a buffer is
|
|
|
|
* being evicted, so the ARC should clean up. If this arc buf
|
|
|
|
* is not yet in the evicted state, it will be put there.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
arc_buf_evict(arc_buf_t *buf)
|
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
arc_buf_t **bufp;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
if (hdr == NULL) {
|
|
|
|
/*
|
|
|
|
* We are in arc_do_user_evicts().
|
|
|
|
*/
|
|
|
|
ASSERT(buf->b_data == NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else if (buf->b_data == NULL) {
|
|
|
|
arc_buf_t copy = *buf; /* structure assignment */
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* We are on the eviction list; process this buffer now
|
|
|
|
* but let arc_do_user_evicts() do the reaping.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
buf->b_efunc = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
VERIFY(copy.b_efunc(©) == 0);
|
|
|
|
return (1);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
hash_lock = HDR_LOCK(hdr);
|
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
|
|
|
|
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pull this buffer off of the hdr
|
|
|
|
*/
|
|
|
|
bufp = &hdr->b_buf;
|
|
|
|
while (*bufp != buf)
|
|
|
|
bufp = &(*bufp)->b_next;
|
|
|
|
*bufp = buf->b_next;
|
|
|
|
|
|
|
|
ASSERT(buf->b_data != NULL);
|
|
|
|
arc_buf_destroy(buf, FALSE, FALSE);
|
|
|
|
|
|
|
|
if (hdr->b_datacnt == 0) {
|
|
|
|
arc_state_t *old_state = hdr->b_state;
|
|
|
|
arc_state_t *evicted_state;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(hdr->b_buf == NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
|
|
|
|
evicted_state =
|
|
|
|
(old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
|
|
|
|
|
|
|
|
mutex_enter(&old_state->arcs_mtx);
|
|
|
|
mutex_enter(&evicted_state->arcs_mtx);
|
|
|
|
|
|
|
|
arc_change_state(evicted_state, hdr, hash_lock);
|
|
|
|
ASSERT(HDR_IN_HASH_TABLE(hdr));
|
|
|
|
hdr->b_flags |= ARC_IN_HASH_TABLE;
|
|
|
|
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
|
|
|
|
|
|
|
|
mutex_exit(&evicted_state->arcs_mtx);
|
|
|
|
mutex_exit(&old_state->arcs_mtx);
|
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
VERIFY(buf->b_efunc(buf) == 0);
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
buf->b_hdr = NULL;
|
2010-05-29 00:45:14 +04:00
|
|
|
buf->b_next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_cache_free(buf_cache, buf);
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Release this buffer from the cache, making it an anonymous buffer. This
|
|
|
|
* must be done after a read and prior to modifying the buffer contents.
|
2008-11-20 23:01:55 +03:00
|
|
|
* If the buffer has more than one reference, we must make
|
2008-12-03 23:09:06 +03:00
|
|
|
* a new hdr for the buffer.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
arc_release(arc_buf_t *buf, void *tag)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_buf_hdr_t *hdr;
|
2010-05-29 00:45:14 +04:00
|
|
|
kmutex_t *hash_lock = NULL;
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_buf_hdr_t *l2hdr;
|
2010-08-26 20:58:04 +04:00
|
|
|
uint64_t buf_size = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* It would be nice to assert that if it's DMU metadata (level >
|
|
|
|
* 0 || it's the dnode file), then it must be syncing context.
|
|
|
|
* But we don't know that information at this level.
|
|
|
|
*/
|
|
|
|
|
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* this buffer is not on any list */
|
|
|
|
ASSERT(refcount_count(&hdr->b_refcnt) > 0);
|
|
|
|
|
|
|
|
if (hdr->b_state == arc_anon) {
|
|
|
|
/* this buffer is already released */
|
|
|
|
ASSERT(buf->b_efunc == NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
} else {
|
|
|
|
hash_lock = HDR_LOCK(hdr);
|
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
l2hdr = hdr->b_l2hdr;
|
|
|
|
if (l2hdr) {
|
|
|
|
mutex_enter(&l2arc_buflist_mtx);
|
|
|
|
hdr->b_l2hdr = NULL;
|
2013-08-30 23:12:45 +04:00
|
|
|
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2013-02-11 10:21:05 +04:00
|
|
|
buf_size = hdr->b_size;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Do we have more than one buf?
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (hdr->b_datacnt > 1) {
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_hdr_t *nhdr;
|
|
|
|
arc_buf_t **bufp;
|
|
|
|
uint64_t blksz = hdr->b_size;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t spa = hdr->b_spa;
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_contents_t type = hdr->b_type;
|
|
|
|
uint32_t flags = hdr->b_flags;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Pull the data off of this hdr and attach it to
|
|
|
|
* a new anonymous hdr.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
(void) remove_reference(hdr, hash_lock, tag);
|
|
|
|
bufp = &hdr->b_buf;
|
|
|
|
while (*bufp != buf)
|
|
|
|
bufp = &(*bufp)->b_next;
|
2010-05-29 00:45:14 +04:00
|
|
|
*bufp = buf->b_next;
|
2008-11-20 23:01:55 +03:00
|
|
|
buf->b_next = NULL;
|
|
|
|
|
|
|
|
ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
|
|
|
|
atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
|
|
|
|
if (refcount_is_zero(&hdr->b_refcnt)) {
|
|
|
|
uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
|
|
|
|
ASSERT3U(*size, >=, hdr->b_size);
|
|
|
|
atomic_add_64(size, -hdr->b_size);
|
|
|
|
}
|
2012-12-22 02:57:09 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're releasing a duplicate user data buffer, update
|
|
|
|
* our statistics accordingly.
|
|
|
|
*/
|
|
|
|
if (hdr->b_type == ARC_BUFC_DATA) {
|
|
|
|
ARCSTAT_BUMPDOWN(arcstat_duplicate_buffers);
|
|
|
|
ARCSTAT_INCR(arcstat_duplicate_buffers_size,
|
|
|
|
-hdr->b_size);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_datacnt -= 1;
|
|
|
|
arc_cksum_verify(buf);
|
2013-05-17 01:18:06 +04:00
|
|
|
arc_buf_unwatch(buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
|
|
|
nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
|
|
|
|
nhdr->b_size = blksz;
|
|
|
|
nhdr->b_spa = spa;
|
|
|
|
nhdr->b_type = type;
|
|
|
|
nhdr->b_buf = buf;
|
|
|
|
nhdr->b_state = arc_anon;
|
|
|
|
nhdr->b_arc_access = 0;
|
2013-10-03 04:11:19 +04:00
|
|
|
nhdr->b_mru_hits = 0;
|
|
|
|
nhdr->b_mru_ghost_hits = 0;
|
|
|
|
nhdr->b_mfu_hits = 0;
|
|
|
|
nhdr->b_mfu_ghost_hits = 0;
|
|
|
|
nhdr->b_l2_hits = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
nhdr->b_flags = flags & ARC_L2_WRITING;
|
|
|
|
nhdr->b_l2hdr = NULL;
|
|
|
|
nhdr->b_datacnt = 1;
|
|
|
|
nhdr->b_freeze_cksum = NULL;
|
|
|
|
(void) refcount_add(&nhdr->b_refcnt, tag);
|
|
|
|
buf->b_hdr = nhdr;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
atomic_add_64(&arc_anon->arcs_size, blksz);
|
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(refcount_count(&hdr->b_refcnt) == 1);
|
|
|
|
ASSERT(!list_link_active(&hdr->b_arc_node));
|
|
|
|
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
|
2010-05-29 00:45:14 +04:00
|
|
|
if (hdr->b_state != arc_anon)
|
|
|
|
arc_change_state(arc_anon, hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
hdr->b_arc_access = 0;
|
2013-10-03 04:11:19 +04:00
|
|
|
hdr->b_mru_hits = 0;
|
|
|
|
hdr->b_mru_ghost_hits = 0;
|
|
|
|
hdr->b_mfu_hits = 0;
|
|
|
|
hdr->b_mfu_ghost_hits = 0;
|
|
|
|
hdr->b_l2_hits = 0;
|
2010-05-29 00:45:14 +04:00
|
|
|
if (hash_lock)
|
|
|
|
mutex_exit(hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
buf_discard_identity(hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_buf_thaw(buf);
|
|
|
|
}
|
|
|
|
buf->b_efunc = NULL;
|
|
|
|
buf->b_private = NULL;
|
|
|
|
|
|
|
|
if (l2hdr) {
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -l2hdr->b_asize);
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(l2hdr->b_dev->l2ad_vdev,
|
|
|
|
-l2hdr->b_asize, 0, 0);
|
2013-11-20 01:34:46 +04:00
|
|
|
kmem_cache_free(l2arc_hdr_cache, l2hdr);
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -buf_size);
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_released(arc_buf_t *buf)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
int released;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (released);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_has_callback(arc_buf_t *buf)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
int callback;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
callback = (buf->b_efunc != NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (callback);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
int
|
|
|
|
arc_referenced(arc_buf_t *buf)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
int referenced;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
referenced = (refcount_count(&buf->b_hdr->b_refcnt));
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&buf->b_evict_lock);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (referenced);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static void
|
|
|
|
arc_write_ready(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
|
|
|
|
callback->awcb_ready(zio, buf, callback->awcb_private);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* If the IO is already in progress, then this is a re-write
|
2008-12-03 23:09:06 +03:00
|
|
|
* attempt, so we need to thaw and re-compute the cksum.
|
|
|
|
* It is the responsibility of the callback to handle the
|
|
|
|
* accounting for any re-write attempt.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (HDR_IO_IN_PROGRESS(hdr)) {
|
|
|
|
mutex_enter(&hdr->b_freeze_lock);
|
|
|
|
if (hdr->b_freeze_cksum != NULL) {
|
|
|
|
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
|
|
|
|
hdr->b_freeze_cksum = NULL;
|
|
|
|
}
|
|
|
|
mutex_exit(&hdr->b_freeze_lock);
|
|
|
|
}
|
|
|
|
arc_cksum_compute(buf, B_FALSE);
|
|
|
|
hdr->b_flags |= ARC_IO_IN_PROGRESS;
|
|
|
|
}
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* The SPA calls this callback for each physical write that happens on behalf
|
|
|
|
* of a logical write. See the comment in dbuf_write_physdone() for details.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
arc_write_physdone(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *cb = zio->io_private;
|
|
|
|
if (cb->awcb_physdone != NULL)
|
|
|
|
cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
arc_write_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
arc_write_callback_t *callback = zio->io_private;
|
|
|
|
arc_buf_t *buf = callback->awcb_buf;
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(hdr->b_acb == NULL);
|
|
|
|
|
|
|
|
if (zio->io_error == 0) {
|
2014-06-06 01:19:08 +04:00
|
|
|
if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {
|
2013-12-09 22:37:51 +04:00
|
|
|
buf_discard_identity(hdr);
|
|
|
|
} else {
|
|
|
|
hdr->b_dva = *BP_IDENTITY(zio->io_bp);
|
|
|
|
hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp);
|
|
|
|
hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
ASSERT(BUF_EMPTY(hdr));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2014-06-06 01:19:08 +04:00
|
|
|
* If the block to be written was all-zero or compressed enough to be
|
|
|
|
* embedded in the BP, no write was performed so there will be no
|
|
|
|
* dva/birth/checksum. The buffer must therefore remain anonymous
|
|
|
|
* (and uncached).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (!BUF_EMPTY(hdr)) {
|
|
|
|
arc_buf_hdr_t *exists;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(zio->io_error == 0);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_cksum_verify(buf);
|
|
|
|
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
if (exists) {
|
|
|
|
/*
|
|
|
|
* This can only happen if we overwrite for
|
|
|
|
* sync-to-convergence, because we remove
|
|
|
|
* buffers from the hash table when we arc_free().
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad overwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
|
|
|
ASSERT(refcount_is_zero(&exists->b_refcnt));
|
|
|
|
arc_change_state(arc_anon, exists, hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
arc_hdr_destroy(exists);
|
|
|
|
exists = buf_hash_insert(hdr, &hash_lock);
|
|
|
|
ASSERT3P(exists, ==, NULL);
|
2013-05-10 23:47:54 +04:00
|
|
|
} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
|
|
|
|
/* nopwrite */
|
|
|
|
ASSERT(zio->io_prop.zp_nopwrite);
|
|
|
|
if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
|
|
|
|
panic("bad nopwrite, hdr=%p exists=%p",
|
|
|
|
(void *)hdr, (void *)exists);
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
/* Dedup */
|
|
|
|
ASSERT(hdr->b_datacnt == 1);
|
|
|
|
ASSERT(hdr->b_state == arc_anon);
|
|
|
|
ASSERT(BP_GET_DEDUP(zio->io_bp));
|
|
|
|
ASSERT(BP_GET_LEVEL(zio->io_bp) == 0);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
|
2008-12-03 23:09:06 +03:00
|
|
|
/* if it's not anon, we are doing a scrub */
|
2010-05-29 00:45:14 +04:00
|
|
|
if (!exists && hdr->b_state == arc_anon)
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_access(hdr, hash_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
} else {
|
|
|
|
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(!refcount_is_zero(&hdr->b_refcnt));
|
|
|
|
callback->awcb_done(zio, buf, callback->awcb_private);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
kmem_free(callback, sizeof (arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
zio_t *
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
|
2013-08-02 00:02:10 +04:00
|
|
|
blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, boolean_t l2arc_compress,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
const zio_prop_t *zp, arc_done_func_t *ready, arc_done_func_t *physdone,
|
|
|
|
arc_done_func_t *done, void *private, zio_priority_t priority,
|
2014-06-25 22:37:59 +04:00
|
|
|
int zio_flags, const zbookmark_phys_t *zb)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *hdr = buf->b_hdr;
|
|
|
|
arc_write_callback_t *callback;
|
2008-12-03 23:09:06 +03:00
|
|
|
zio_t *zio;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(ready != NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(done != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(!HDR_IO_ERROR(hdr));
|
|
|
|
ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(hdr->b_acb == NULL);
|
2008-12-03 23:09:06 +03:00
|
|
|
if (l2arc)
|
|
|
|
hdr->b_flags |= ARC_L2CACHE;
|
2013-08-02 00:02:10 +04:00
|
|
|
if (l2arc_compress)
|
|
|
|
hdr->b_flags |= ARC_L2COMPRESS;
|
2012-05-07 21:49:51 +04:00
|
|
|
callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_ready = ready;
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
callback->awcb_physdone = physdone;
|
2008-11-20 23:01:55 +03:00
|
|
|
callback->awcb_done = done;
|
|
|
|
callback->awcb_private = private;
|
|
|
|
callback->awcb_buf = buf;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, zp,
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_write_ready, arc_write_physdone, arc_write_done, callback,
|
|
|
|
priority, zio_flags, zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
arc_memory_throttle(uint64_t reserve, uint64_t txg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
#ifdef _KERNEL
|
2013-02-01 21:33:04 +04:00
|
|
|
if (zfs_arc_memory_throttle_disable)
|
|
|
|
return (0);
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
if (freemem <= physmem * arc_lotsfree_percent / 100) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
|
2012-01-20 22:58:57 +04:00
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EAGAIN));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_tempreserve_clear(uint64_t reserve)
|
|
|
|
{
|
|
|
|
atomic_add_64(&arc_tempreserve, -reserve);
|
|
|
|
ASSERT((int64_t)arc_tempreserve >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
arc_tempreserve_space(uint64_t reserve, uint64_t txg)
|
|
|
|
{
|
|
|
|
int error;
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t anon_size;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (reserve > arc_c/4 && !arc_no_grow)
|
|
|
|
arc_c = MIN(arc_c_max, reserve * 4);
|
2014-04-29 00:56:47 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle when the calculated memory footprint for the TXG
|
|
|
|
* exceeds the target ARC size.
|
|
|
|
*/
|
2012-01-20 22:58:57 +04:00
|
|
|
if (reserve > arc_c) {
|
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);
|
2014-04-29 00:56:47 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2012-01-20 22:58:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Don't count loaned bufs as in flight dirty data to prevent long
|
|
|
|
* network delays from blocking transactions that are ready to be
|
|
|
|
* assigned to a txg.
|
|
|
|
*/
|
|
|
|
anon_size = MAX((int64_t)(arc_anon->arcs_size - arc_loaned_bytes), 0);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Writes will, almost always, require additional memory allocations
|
2013-06-11 21:12:34 +04:00
|
|
|
* in order to compress/encrypt/etc the data. We therefore need to
|
2008-11-20 23:01:55 +03:00
|
|
|
* make sure that there is sufficient available memory for this.
|
|
|
|
*/
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
error = arc_memory_throttle(reserve, txg);
|
|
|
|
if (error != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Throttle writes when the amount of dirty data in the cache
|
|
|
|
* gets too large. We try to keep the cache less than half full
|
|
|
|
* of dirty blocks so that our sync times don't grow too large.
|
|
|
|
* Note: if two requests come in concurrently, we might let them
|
|
|
|
* both succeed, when one of them should fail. Not a huge deal.
|
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
if (reserve + arc_tempreserve + anon_size > arc_c / 2 &&
|
|
|
|
anon_size > arc_c / 4) {
|
2008-11-20 23:01:55 +03:00
|
|
|
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
|
|
|
|
"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
|
|
|
|
arc_tempreserve>>10,
|
|
|
|
arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
|
|
|
|
arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
|
|
|
|
reserve>>10, arc_c>>10);
|
2012-01-20 22:58:57 +04:00
|
|
|
DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ERESTART));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
atomic_add_64(&arc_tempreserve, reserve);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2012-01-31 01:28:40 +04:00
|
|
|
static void
|
|
|
|
arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,
|
|
|
|
kstat_named_t *evict_data, kstat_named_t *evict_metadata)
|
|
|
|
{
|
|
|
|
size->value.ui64 = state->arcs_size;
|
|
|
|
evict_data->value.ui64 = state->arcs_lsize[ARC_BUFC_DATA];
|
|
|
|
evict_metadata->value.ui64 = state->arcs_lsize[ARC_BUFC_METADATA];
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
arc_kstat_update(kstat_t *ksp, int rw)
|
|
|
|
{
|
|
|
|
arc_stats_t *as = ksp->ks_data;
|
|
|
|
|
|
|
|
if (rw == KSTAT_WRITE) {
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EACCES));
|
2012-01-31 01:28:40 +04:00
|
|
|
} else {
|
|
|
|
arc_kstat_update_state(arc_anon,
|
|
|
|
&as->arcstat_anon_size,
|
|
|
|
&as->arcstat_anon_evict_data,
|
|
|
|
&as->arcstat_anon_evict_metadata);
|
|
|
|
arc_kstat_update_state(arc_mru,
|
|
|
|
&as->arcstat_mru_size,
|
|
|
|
&as->arcstat_mru_evict_data,
|
|
|
|
&as->arcstat_mru_evict_metadata);
|
|
|
|
arc_kstat_update_state(arc_mru_ghost,
|
|
|
|
&as->arcstat_mru_ghost_size,
|
|
|
|
&as->arcstat_mru_ghost_evict_data,
|
|
|
|
&as->arcstat_mru_ghost_evict_metadata);
|
|
|
|
arc_kstat_update_state(arc_mfu,
|
|
|
|
&as->arcstat_mfu_size,
|
|
|
|
&as->arcstat_mfu_evict_data,
|
|
|
|
&as->arcstat_mfu_evict_metadata);
|
2012-03-27 21:10:26 +04:00
|
|
|
arc_kstat_update_state(arc_mfu_ghost,
|
2012-01-31 01:28:40 +04:00
|
|
|
&as->arcstat_mfu_ghost_size,
|
|
|
|
&as->arcstat_mfu_ghost_evict_data,
|
|
|
|
&as->arcstat_mfu_ghost_evict_metadata);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
arc_init(void)
|
|
|
|
{
|
|
|
|
mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
|
|
|
|
/* Convert seconds to clock ticks */
|
2013-07-24 21:14:11 +04:00
|
|
|
zfs_arc_min_prefetch_lifespan = 1 * hz;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* Start out with 1/8 of all memory */
|
|
|
|
arc_c = physmem * PAGESIZE / 8;
|
|
|
|
|
|
|
|
#ifdef _KERNEL
|
|
|
|
/*
|
|
|
|
* On architectures where the physical memory can be larger
|
|
|
|
* than the addressable space (intel in 32-bit mode), we may
|
|
|
|
* need to limit the cache to 1/8 of VM size.
|
|
|
|
*/
|
|
|
|
arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
|
2011-03-30 05:08:59 +04:00
|
|
|
/*
|
|
|
|
* Register a shrinker to support synchronous (direct) memory
|
|
|
|
* reclaim from the arc. This is done to prevent kswapd from
|
|
|
|
* swapping out pages when it is preferable to shrink the arc.
|
|
|
|
*/
|
|
|
|
spl_register_shrinker(&arc_shrinker);
|
2008-11-20 23:01:55 +03:00
|
|
|
#endif
|
|
|
|
|
2013-07-24 02:33:23 +04:00
|
|
|
/* set min cache to zero */
|
|
|
|
arc_c_min = 4<<20;
|
2012-04-13 01:22:08 +04:00
|
|
|
/* set max to 1/2 of all memory */
|
2013-12-09 11:55:01 +04:00
|
|
|
arc_c_max = arc_c * 4;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Allow the tunables to override our calculations if they are
|
|
|
|
* reasonable (ie. over 64MB)
|
|
|
|
*/
|
|
|
|
if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
|
|
|
|
arc_c_max = zfs_arc_max;
|
2013-07-24 02:33:23 +04:00
|
|
|
if (zfs_arc_min > 0 && zfs_arc_min <= arc_c_max)
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_c_min = zfs_arc_min;
|
|
|
|
|
|
|
|
arc_c = arc_c_max;
|
|
|
|
arc_p = (arc_c >> 1);
|
|
|
|
|
Set "arc_meta_limit" to 3/4 arc_c_max by default
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..
Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).
This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.
If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.
So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.
To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).
Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:
- 16K for the dnode object's block - [ 16384 bytes]
- N * sizeof(dnode_t) -------------- [ N * 928 bytes]
- (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes]
- (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
- (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]
To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:
Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
Pinned "arc_meta_used" bytes = 17000 + N * 1544
This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.
Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:
Total = 17000 + N * 1544 + S * N
^^^^^^^^^^^^^^^^ ^^^^^
metadata data
So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).
File Sizes (S)
| 512 | 1024 | 2048 | 4096 | 8192 | 16384 |
---+----------+----------+----------+----------+----------+----------+
1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |
As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.
This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-04 00:21:51 +04:00
|
|
|
/* limit meta-data to 3/4 of the arc capacity */
|
|
|
|
arc_meta_limit = (3 * arc_c_max) / 4;
|
2011-03-24 22:13:55 +03:00
|
|
|
arc_meta_max = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/* Allow the tunable to override if it is reasonable */
|
|
|
|
if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
|
|
|
|
arc_meta_limit = zfs_arc_meta_limit;
|
|
|
|
|
|
|
|
/* if kmem_flags are set, lets try to use less memory */
|
|
|
|
if (kmem_debugging())
|
|
|
|
arc_c = arc_c / 2;
|
|
|
|
if (arc_c < arc_c_min)
|
|
|
|
arc_c = arc_c_min;
|
|
|
|
|
|
|
|
arc_anon = &ARC_anon;
|
|
|
|
arc_mru = &ARC_mru;
|
|
|
|
arc_mru_ghost = &ARC_mru_ghost;
|
|
|
|
arc_mfu = &ARC_mfu;
|
|
|
|
arc_mfu_ghost = &ARC_mfu_ghost;
|
|
|
|
arc_l2c_only = &ARC_l2c_only;
|
|
|
|
arc_size = 0;
|
|
|
|
|
|
|
|
mutex_init(&arc_anon->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&arc_mru->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&arc_mru_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&arc_mfu->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&arc_mfu_ghost->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&arc_l2c_only->arcs_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
|
|
|
list_create(&arc_mru->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mru->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mfu->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mfu->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
list_create(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],
|
|
|
|
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
|
|
|
|
|
2013-10-03 04:11:19 +04:00
|
|
|
arc_anon->arcs_state = ARC_STATE_ANON;
|
|
|
|
arc_mru->arcs_state = ARC_STATE_MRU;
|
|
|
|
arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;
|
|
|
|
arc_mfu->arcs_state = ARC_STATE_MFU;
|
|
|
|
arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;
|
|
|
|
arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_init();
|
|
|
|
|
|
|
|
arc_thread_exit = 0;
|
2011-12-23 00:20:43 +04:00
|
|
|
list_create(&arc_prune_list, sizeof (arc_prune_t),
|
|
|
|
offsetof(arc_prune_t, p_node));
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_eviction_list = NULL;
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
|
|
|
|
|
|
|
|
arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
|
|
|
|
sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
|
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
arc_ksp->ks_data = &arc_stats;
|
2012-01-31 01:28:40 +04:00
|
|
|
arc_ksp->ks_update = arc_kstat_update;
|
2008-11-20 23:01:55 +03:00
|
|
|
kstat_install(arc_ksp);
|
|
|
|
}
|
|
|
|
|
2012-03-14 01:29:16 +04:00
|
|
|
(void) thread_create(NULL, 0, arc_adapt_thread, NULL, 0, &p0,
|
2008-11-20 23:01:55 +03:00
|
|
|
TS_RUN, minclsyspri);
|
|
|
|
|
|
|
|
arc_dead = FALSE;
|
2008-12-03 23:09:06 +03:00
|
|
|
arc_warm = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 07:01:20 +04:00
|
|
|
/*
|
|
|
|
* Calculate maximum amount of dirty data per pool.
|
|
|
|
*
|
|
|
|
* If it has been set by a module parameter, take that.
|
|
|
|
* Otherwise, use a percentage of physical memory defined by
|
|
|
|
* zfs_dirty_data_max_percent (default 10%) with a cap at
|
|
|
|
* zfs_dirty_data_max_max (default 25% of physical memory).
|
|
|
|
*/
|
|
|
|
if (zfs_dirty_data_max_max == 0)
|
|
|
|
zfs_dirty_data_max_max = physmem * PAGESIZE *
|
|
|
|
zfs_dirty_data_max_max_percent / 100;
|
|
|
|
|
|
|
|
if (zfs_dirty_data_max == 0) {
|
|
|
|
zfs_dirty_data_max = physmem * PAGESIZE *
|
|
|
|
zfs_dirty_data_max_percent / 100;
|
|
|
|
zfs_dirty_data_max = MIN(zfs_dirty_data_max,
|
|
|
|
zfs_dirty_data_max_max);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
arc_fini(void)
|
|
|
|
{
|
2011-12-23 00:20:43 +04:00
|
|
|
arc_prune_t *p;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&arc_reclaim_thr_lock);
|
2011-03-30 05:08:59 +04:00
|
|
|
#ifdef _KERNEL
|
|
|
|
spl_unregister_shrinker(&arc_shrinker);
|
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_thread_exit = 1;
|
|
|
|
while (arc_thread_exit != 0)
|
|
|
|
cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
|
|
|
|
mutex_exit(&arc_reclaim_thr_lock);
|
|
|
|
|
|
|
|
arc_flush(NULL);
|
|
|
|
|
|
|
|
arc_dead = TRUE;
|
|
|
|
|
|
|
|
if (arc_ksp != NULL) {
|
|
|
|
kstat_delete(arc_ksp);
|
|
|
|
arc_ksp = NULL;
|
|
|
|
}
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
mutex_enter(&arc_prune_mtx);
|
|
|
|
while ((p = list_head(&arc_prune_list)) != NULL) {
|
|
|
|
list_remove(&arc_prune_list, p);
|
|
|
|
refcount_remove(&p->p_refcnt, &arc_prune_list);
|
|
|
|
refcount_destroy(&p->p_refcnt);
|
|
|
|
kmem_free(p, sizeof (*p));
|
|
|
|
}
|
|
|
|
mutex_exit(&arc_prune_mtx);
|
|
|
|
|
|
|
|
list_destroy(&arc_prune_list);
|
|
|
|
mutex_destroy(&arc_prune_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_destroy(&arc_eviction_mtx);
|
|
|
|
mutex_destroy(&arc_reclaim_thr_lock);
|
|
|
|
cv_destroy(&arc_reclaim_thr_cv);
|
|
|
|
|
|
|
|
list_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
list_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);
|
|
|
|
list_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
list_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
list_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
list_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);
|
|
|
|
|
|
|
|
mutex_destroy(&arc_anon->arcs_mtx);
|
|
|
|
mutex_destroy(&arc_mru->arcs_mtx);
|
|
|
|
mutex_destroy(&arc_mru_ghost->arcs_mtx);
|
|
|
|
mutex_destroy(&arc_mfu->arcs_mtx);
|
|
|
|
mutex_destroy(&arc_mfu_ghost->arcs_mtx);
|
2009-01-16 00:59:39 +03:00
|
|
|
mutex_destroy(&arc_l2c_only->arcs_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
buf_fini();
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
ASSERT(arc_loaned_bytes == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC
|
|
|
|
*
|
|
|
|
* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
|
|
|
|
* It uses dedicated storage devices to hold cached data, which are populated
|
|
|
|
* using large infrequent writes. The main role of this cache is to boost
|
|
|
|
* the performance of random read workloads. The intended L2ARC devices
|
|
|
|
* include short-stroked disks, solid state disks, and other media with
|
|
|
|
* substantially faster read latency than disk.
|
|
|
|
*
|
|
|
|
* +-----------------------+
|
|
|
|
* | ARC |
|
|
|
|
* +-----------------------+
|
|
|
|
* | ^ ^
|
|
|
|
* | | |
|
|
|
|
* l2arc_feed_thread() arc_read()
|
|
|
|
* | | |
|
|
|
|
* | l2arc read |
|
|
|
|
* V | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | L2ARC | |
|
|
|
|
* +---------------+ |
|
|
|
|
* | ^ |
|
|
|
|
* l2arc_write() | |
|
|
|
|
* | | |
|
|
|
|
* V | |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* | vdev | | vdev |
|
|
|
|
* | cache | | cache |
|
|
|
|
* +-------+ +-------+
|
|
|
|
* +=========+ .-----.
|
|
|
|
* : L2ARC : |-_____-|
|
|
|
|
* : devices : | Disks |
|
|
|
|
* +=========+ `-_____-'
|
|
|
|
*
|
|
|
|
* Read requests are satisfied from the following sources, in order:
|
|
|
|
*
|
|
|
|
* 1) ARC
|
|
|
|
* 2) vdev cache of L2ARC devices
|
|
|
|
* 3) L2ARC devices
|
|
|
|
* 4) vdev cache of disks
|
|
|
|
* 5) disks
|
|
|
|
*
|
|
|
|
* Some L2ARC device types exhibit extremely slow write performance.
|
|
|
|
* To accommodate for this there are some significant differences between
|
|
|
|
* the L2ARC and traditional cache design:
|
|
|
|
*
|
|
|
|
* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
|
|
|
|
* the ARC behave as usual, freeing buffers and placing headers on ghost
|
|
|
|
* lists. The ARC does not send buffers to the L2ARC during eviction as
|
|
|
|
* this would add inflated write latencies for all ARC memory pressure.
|
|
|
|
*
|
|
|
|
* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
|
|
|
|
* It does this by periodically scanning buffers from the eviction-end of
|
|
|
|
* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
|
2013-08-02 00:02:10 +04:00
|
|
|
* not already there. It scans until a headroom of buffers is satisfied,
|
|
|
|
* which itself is a buffer for ARC eviction. If a compressible buffer is
|
|
|
|
* found during scanning and selected for writing to an L2ARC device, we
|
|
|
|
* temporarily boost scanning headroom during the next scan cycle to make
|
|
|
|
* sure we adapt to compression effects (which might significantly reduce
|
|
|
|
* the data volume we write to L2ARC). The thread that does this is
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_thread(), illustrated below; example sizes are included to
|
|
|
|
* provide a better sense of ratio than this diagram:
|
|
|
|
*
|
|
|
|
* head --> tail
|
|
|
|
* +---------------------+----------+
|
|
|
|
* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
|
|
|
|
* +---------------------+----------+ | o L2ARC eligible
|
|
|
|
* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
|
|
|
|
* +---------------------+----------+ |
|
|
|
|
* 15.9 Gbytes ^ 32 Mbytes |
|
|
|
|
* headroom |
|
|
|
|
* l2arc_feed_thread()
|
|
|
|
* |
|
|
|
|
* l2arc write hand <--[oooo]--'
|
|
|
|
* | 8 Mbyte
|
|
|
|
* | write max
|
|
|
|
* V
|
|
|
|
* +==============================+
|
|
|
|
* L2ARC dev |####|#|###|###| |####| ... |
|
|
|
|
* +==============================+
|
|
|
|
* 32 Gbytes
|
|
|
|
*
|
|
|
|
* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
|
|
|
|
* evicted, then the L2ARC has cached a buffer much sooner than it probably
|
|
|
|
* needed to, potentially wasting L2ARC device bandwidth and storage. It is
|
|
|
|
* safe to say that this is an uncommon case, since buffers at the end of
|
|
|
|
* the ARC lists have moved there due to inactivity.
|
|
|
|
*
|
|
|
|
* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
|
|
|
|
* then the L2ARC simply misses copying some buffers. This serves as a
|
|
|
|
* pressure valve to prevent heavy read workloads from both stalling the ARC
|
|
|
|
* with waits and clogging the L2ARC with writes. This also helps prevent
|
|
|
|
* the potential for the L2ARC to churn if it attempts to cache content too
|
|
|
|
* quickly, such as during backups of the entire pool.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 5. After system boot and before the ARC has filled main memory, there are
|
|
|
|
* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
|
|
|
|
* lists can remain mostly static. Instead of searching from tail of these
|
|
|
|
* lists as pictured, the l2arc_feed_thread() will search from the list heads
|
|
|
|
* for eligible buffers, greatly increasing its chance of finding them.
|
|
|
|
*
|
|
|
|
* The L2ARC device write speed is also boosted during this time so that
|
|
|
|
* the L2ARC warms up faster. Since there have been no ARC evictions yet,
|
|
|
|
* there are no L2ARC reads, and no fear of degrading read performance
|
|
|
|
* through increased writes.
|
|
|
|
*
|
|
|
|
* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
|
2008-11-20 23:01:55 +03:00
|
|
|
* the vdev queue can aggregate them into larger and fewer writes. Each
|
|
|
|
* device is written to in a rotor fashion, sweeping writes through
|
|
|
|
* available space then repeating.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 7. The L2ARC does not store dirty content. It never needs to flush
|
2008-11-20 23:01:55 +03:00
|
|
|
* write buffers back to disk based storage.
|
|
|
|
*
|
2008-12-03 23:09:06 +03:00
|
|
|
* 8. If an ARC buffer is written (and dirtied) which also exists in the
|
2008-11-20 23:01:55 +03:00
|
|
|
* L2ARC, the now stale L2ARC buffer is immediately dropped.
|
|
|
|
*
|
|
|
|
* The performance of the L2ARC can be tweaked by a number of tunables, which
|
|
|
|
* may be necessary for different workloads:
|
|
|
|
*
|
|
|
|
* l2arc_write_max max write bytes per interval
|
2008-12-03 23:09:06 +03:00
|
|
|
* l2arc_write_boost extra write bytes during device warmup
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_noprefetch skip caching prefetched buffers
|
2013-08-02 00:02:10 +04:00
|
|
|
* l2arc_nocompress skip compressing buffers
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_headroom number of max device writes to precache
|
2013-08-02 00:02:10 +04:00
|
|
|
* l2arc_headroom_boost when we find compressed buffers during ARC
|
|
|
|
* scanning, we multiply headroom by this
|
|
|
|
* percentage factor for the next scan cycle,
|
|
|
|
* since more compressed buffers are likely to
|
|
|
|
* be present
|
2008-11-20 23:01:55 +03:00
|
|
|
* l2arc_feed_secs seconds between L2ARC writing
|
|
|
|
*
|
|
|
|
* Tunables may be removed or added as future performance improvements are
|
|
|
|
* integrated, and also may become zpool properties.
|
2009-02-18 23:51:31 +03:00
|
|
|
*
|
|
|
|
* There are three key functions that control how the L2ARC warms up:
|
|
|
|
*
|
|
|
|
* l2arc_write_eligible() check if a buffer is eligible to cache
|
|
|
|
* l2arc_write_size() calculate how much to write
|
|
|
|
* l2arc_write_interval() calculate sleep delay between writes
|
|
|
|
*
|
|
|
|
* These three functions determine what to write, how much, and how quickly
|
|
|
|
* to send writes.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
static boolean_t
|
|
|
|
l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *ab)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* A buffer is *not* eligible for the L2ARC if it:
|
|
|
|
* 1. belongs to a different spa.
|
2010-05-29 00:45:14 +04:00
|
|
|
* 2. is already cached on the L2ARC.
|
|
|
|
* 3. has an I/O in progress (it may be an incomplete read).
|
|
|
|
* 4. is flagged not eligible (zfs property).
|
2009-02-18 23:51:31 +03:00
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
if (ab->b_spa != spa_guid || ab->b_l2hdr != NULL ||
|
2009-02-18 23:51:31 +03:00
|
|
|
HDR_IO_IN_PROGRESS(ab) || !HDR_L2CACHE(ab))
|
|
|
|
return (B_FALSE);
|
|
|
|
|
|
|
|
return (B_TRUE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static uint64_t
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_write_size(void)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
|
|
|
uint64_t size;
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* Make sure our globals have meaningful values in case the user
|
|
|
|
* altered them.
|
|
|
|
*/
|
|
|
|
size = l2arc_write_max;
|
|
|
|
if (size == 0) {
|
|
|
|
cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must "
|
|
|
|
"be greater than zero, resetting it to the default (%d)",
|
|
|
|
L2ARC_WRITE_SIZE);
|
|
|
|
size = l2arc_write_max = L2ARC_WRITE_SIZE;
|
|
|
|
}
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
if (arc_warm == B_FALSE)
|
2013-08-02 00:02:10 +04:00
|
|
|
size += l2arc_write_boost;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
return (size);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
static clock_t
|
|
|
|
l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t interval, next, now;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the ARC lists are busy, increase our write rate; if the
|
|
|
|
* lists are stale, idle back. This is achieved by checking
|
|
|
|
* how much we previously wrote - if it was more than half of
|
|
|
|
* what we wanted, schedule the next write much sooner.
|
|
|
|
*/
|
|
|
|
if (l2arc_feed_again && wrote > (wanted / 2))
|
|
|
|
interval = (hz * l2arc_feed_min_ms) / 1000;
|
|
|
|
else
|
|
|
|
interval = hz * l2arc_feed_secs;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
now = ddi_get_lbolt();
|
|
|
|
next = MAX(now, MIN(now + interval, began + interval));
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
l2arc_hdr_stat_add(void)
|
|
|
|
{
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
l2arc_hdr_stat_remove(void)
|
|
|
|
{
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_hdr_size, -HDR_SIZE);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Cycle through L2ARC devices. This is how L2ARC load balances.
|
2008-12-03 23:09:06 +03:00
|
|
|
* If a device is returned, this also returns holding the spa config lock.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
static l2arc_dev_t *
|
|
|
|
l2arc_dev_get_next(void)
|
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_dev_t *first, *next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Lock out the removal of spas (spa_namespace_lock), then removal
|
|
|
|
* of cache devices (l2arc_dev_mtx). Once a device has been selected,
|
|
|
|
* both locks will be dropped and a spa config lock held instead.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/* if there are no vdevs, there is nothing to do */
|
|
|
|
if (l2arc_ndev == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
first = NULL;
|
|
|
|
next = l2arc_dev_last;
|
|
|
|
do {
|
|
|
|
/* loop around the list looking for a non-faulted vdev */
|
|
|
|
if (next == NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
next = list_head(l2arc_dev_list);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
|
|
|
next = list_next(l2arc_dev_list, next);
|
|
|
|
if (next == NULL)
|
|
|
|
next = list_head(l2arc_dev_list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* if we have come back to the start, bail out */
|
|
|
|
if (first == NULL)
|
|
|
|
first = next;
|
|
|
|
else if (next == first)
|
|
|
|
break;
|
|
|
|
|
|
|
|
} while (vdev_is_dead(next->l2ad_vdev));
|
|
|
|
|
|
|
|
/* if we were unable to find any usable vdevs, return NULL */
|
|
|
|
if (vdev_is_dead(next->l2ad_vdev))
|
|
|
|
next = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
l2arc_dev_last = next;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
out:
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Grab the config lock to prevent the 'next' device from being
|
|
|
|
* removed while we are writing to it.
|
|
|
|
*/
|
|
|
|
if (next != NULL)
|
|
|
|
spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (next);
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Free buffers that were tagged for destruction.
|
|
|
|
*/
|
|
|
|
static void
|
2010-08-26 20:52:41 +04:00
|
|
|
l2arc_do_free_on_write(void)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
|
|
|
list_t *buflist;
|
|
|
|
l2arc_data_free_t *df, *df_prev;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_free_on_write_mtx);
|
|
|
|
buflist = l2arc_free_on_write;
|
|
|
|
|
|
|
|
for (df = list_tail(buflist); df; df = df_prev) {
|
|
|
|
df_prev = list_prev(buflist, df);
|
|
|
|
ASSERT(df->l2df_data != NULL);
|
|
|
|
ASSERT(df->l2df_func != NULL);
|
|
|
|
df->l2df_func(df->l2df_data, df->l2df_size);
|
|
|
|
list_remove(buflist, df);
|
|
|
|
kmem_free(df, sizeof (l2arc_data_free_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(&l2arc_free_on_write_mtx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* A write to a cache device has completed. Update all headers to allow
|
|
|
|
* reads from these buffers to begin.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_write_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
l2arc_write_callback_t *cb;
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
list_t *buflist;
|
|
|
|
arc_buf_hdr_t *head, *ab, *ab_prev;
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_buf_hdr_t *abl2;
|
2008-11-20 23:01:55 +03:00
|
|
|
kmutex_t *hash_lock;
|
2014-05-22 13:11:57 +04:00
|
|
|
int64_t bytes_dropped = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = zio->io_private;
|
|
|
|
ASSERT(cb != NULL);
|
|
|
|
dev = cb->l2wcb_dev;
|
|
|
|
ASSERT(dev != NULL);
|
|
|
|
head = cb->l2wcb_head;
|
|
|
|
ASSERT(head != NULL);
|
|
|
|
buflist = dev->l2ad_buflist;
|
|
|
|
ASSERT(buflist != NULL);
|
|
|
|
DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
|
|
|
|
l2arc_write_callback_t *, cb);
|
|
|
|
|
|
|
|
if (zio->io_error != 0)
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_error);
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_buflist_mtx);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All writes completed, or an error was hit.
|
|
|
|
*/
|
|
|
|
for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
|
|
|
|
ab_prev = list_prev(buflist, ab);
|
2013-10-15 02:29:45 +04:00
|
|
|
abl2 = ab->b_l2hdr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Release the temporary compressed buffer as soon as possible.
|
|
|
|
*/
|
|
|
|
if (abl2->b_compress != ZIO_COMPRESS_OFF)
|
|
|
|
l2arc_release_cdata_buf(ab);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
hash_lock = HDR_LOCK(ab);
|
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
|
|
|
* This buffer misses out. It may be in a stage
|
|
|
|
* of eviction. Its ARC_L2_WRITING flag will be
|
|
|
|
* left set, denying reads to this buffer.
|
|
|
|
*/
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (zio->io_error != 0) {
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Error - drop L2ARC entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
list_remove(buflist, ab);
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
|
2014-05-22 13:11:57 +04:00
|
|
|
bytes_dropped += abl2->b_asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
ab->b_l2hdr = NULL;
|
2013-11-20 01:34:46 +04:00
|
|
|
kmem_cache_free(l2arc_hdr_cache, abl2);
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allow ARC to begin reads to this L2ARC entry.
|
|
|
|
*/
|
|
|
|
ab->b_flags &= ~ARC_L2_WRITING;
|
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_inc_64(&l2arc_writes_done);
|
|
|
|
list_remove(buflist, head);
|
|
|
|
kmem_cache_free(hdr_cache, head);
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
|
|
|
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_do_free_on_write();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_write_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A read to a cache device completed. Validate buffer contents before
|
|
|
|
* handing over to the regular ARC routines.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_read_done(zio_t *zio)
|
|
|
|
{
|
|
|
|
l2arc_read_callback_t *cb;
|
|
|
|
arc_buf_hdr_t *hdr;
|
|
|
|
arc_buf_t *buf;
|
|
|
|
kmutex_t *hash_lock;
|
2008-12-03 23:09:06 +03:00
|
|
|
int equal;
|
|
|
|
|
|
|
|
ASSERT(zio->io_vd != NULL);
|
|
|
|
ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
|
|
|
|
|
|
|
|
spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
cb = zio->io_private;
|
|
|
|
ASSERT(cb != NULL);
|
|
|
|
buf = cb->l2rcb_buf;
|
|
|
|
ASSERT(buf != NULL);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
hash_lock = HDR_LOCK(buf->b_hdr);
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(hash_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
hdr = buf->b_hdr;
|
|
|
|
ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* If the buffer was compressed, decompress it first.
|
|
|
|
*/
|
|
|
|
if (cb->l2rcb_compress != ZIO_COMPRESS_OFF)
|
|
|
|
l2arc_decompress_zio(zio, hdr, cb->l2rcb_compress);
|
|
|
|
ASSERT(zio->io_data != NULL);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Check this survived the L2ARC journey.
|
|
|
|
*/
|
|
|
|
equal = arc_cksum_equal(buf);
|
|
|
|
if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
zio->io_private = buf;
|
2008-12-03 23:09:06 +03:00
|
|
|
zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
|
|
|
|
zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
|
2008-11-20 23:01:55 +03:00
|
|
|
arc_read_done(zio);
|
|
|
|
} else {
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
/*
|
|
|
|
* Buffer didn't survive caching. Increment stats and
|
|
|
|
* reissue to the original storage device.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio->io_error != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_io_error);
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2013-03-08 22:41:28 +04:00
|
|
|
zio->io_error = SET_ERROR(EIO);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
if (!equal)
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_cksum_bad);
|
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* If there's no waiter, issue an async i/o to the primary
|
|
|
|
* storage now. If there *is* a waiter, the caller must
|
|
|
|
* issue the i/o in a context where it's OK to block.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
if (zio->io_waiter == NULL) {
|
|
|
|
zio_t *pio = zio_unique_parent(zio);
|
|
|
|
|
|
|
|
ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);
|
|
|
|
|
|
|
|
zio_nowait(zio_read(pio, cb->l2rcb_spa, &cb->l2rcb_bp,
|
2008-12-03 23:09:06 +03:00
|
|
|
buf->b_data, zio->io_size, arc_read_done, buf,
|
|
|
|
zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
kmem_free(cb, sizeof (l2arc_read_callback_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is the list priority from which the L2ARC will search for pages to
|
|
|
|
* cache. This is used within loops (0..3) to cycle through lists in the
|
|
|
|
* desired order. This order can have a significant effect on cache
|
|
|
|
* performance.
|
|
|
|
*
|
|
|
|
* Currently the metadata lists are hit first, MFU then MRU, followed by
|
|
|
|
* the data lists. This function returns a locked list, and also returns
|
|
|
|
* the lock pointer.
|
|
|
|
*/
|
|
|
|
static list_t *
|
|
|
|
l2arc_list_locked(int list_num, kmutex_t **lock)
|
|
|
|
{
|
2010-08-26 20:58:04 +04:00
|
|
|
list_t *list = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(list_num >= 0 && list_num <= 3);
|
|
|
|
|
|
|
|
switch (list_num) {
|
|
|
|
case 0:
|
|
|
|
list = &arc_mfu->arcs_list[ARC_BUFC_METADATA];
|
|
|
|
*lock = &arc_mfu->arcs_mtx;
|
|
|
|
break;
|
|
|
|
case 1:
|
|
|
|
list = &arc_mru->arcs_list[ARC_BUFC_METADATA];
|
|
|
|
*lock = &arc_mru->arcs_mtx;
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
list = &arc_mfu->arcs_list[ARC_BUFC_DATA];
|
|
|
|
*lock = &arc_mfu->arcs_mtx;
|
|
|
|
break;
|
|
|
|
case 3:
|
|
|
|
list = &arc_mru->arcs_list[ARC_BUFC_DATA];
|
|
|
|
*lock = &arc_mru->arcs_mtx;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(!(MUTEX_HELD(*lock)));
|
|
|
|
mutex_enter(*lock);
|
|
|
|
return (list);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Evict buffers from the device write hand to the distance specified in
|
|
|
|
* bytes. This distance may span populated buffers, it may span nothing.
|
|
|
|
* This is clearing a region on the L2ARC device ready for writing.
|
|
|
|
* If the 'all' boolean is set, every buffer is evicted.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
|
|
|
|
{
|
|
|
|
list_t *buflist;
|
|
|
|
l2arc_buf_hdr_t *abl2;
|
|
|
|
arc_buf_hdr_t *ab, *ab_prev;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t taddr;
|
2014-05-22 13:11:57 +04:00
|
|
|
int64_t bytes_evicted = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
buflist = dev->l2ad_buflist;
|
|
|
|
|
|
|
|
if (buflist == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!all && dev->l2ad_first) {
|
|
|
|
/*
|
|
|
|
* This is the first sweep through the device. There is
|
|
|
|
* nothing to evict.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* When nearing the end of the device, evict to the end
|
|
|
|
* before the device write hand jumps to the start.
|
|
|
|
*/
|
|
|
|
taddr = dev->l2ad_end;
|
|
|
|
} else {
|
|
|
|
taddr = dev->l2ad_hand + distance;
|
|
|
|
}
|
|
|
|
DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
|
|
|
|
uint64_t, taddr, boolean_t, all);
|
|
|
|
|
|
|
|
top:
|
|
|
|
mutex_enter(&l2arc_buflist_mtx);
|
|
|
|
for (ab = list_tail(buflist); ab; ab = ab_prev) {
|
|
|
|
ab_prev = list_prev(buflist, ab);
|
|
|
|
|
|
|
|
hash_lock = HDR_LOCK(ab);
|
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
|
|
|
/*
|
|
|
|
* Missed the hash lock. Retry.
|
|
|
|
*/
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
|
|
|
mutex_enter(hash_lock);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
goto top;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (HDR_L2_WRITE_HEAD(ab)) {
|
|
|
|
/*
|
|
|
|
* We hit a write head node. Leave it for
|
|
|
|
* l2arc_write_done().
|
|
|
|
*/
|
|
|
|
list_remove(buflist, ab);
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!all && ab->b_l2hdr != NULL &&
|
|
|
|
(ab->b_l2hdr->b_daddr > taddr ||
|
|
|
|
ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
|
|
|
|
/*
|
|
|
|
* We've evicted to the target address,
|
|
|
|
* or the end of the device.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (HDR_FREE_IN_PROGRESS(ab)) {
|
|
|
|
/*
|
|
|
|
* Already on the path to destruction.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ab->b_state == arc_l2c_only) {
|
|
|
|
ASSERT(!HDR_L2_READING(ab));
|
|
|
|
/*
|
|
|
|
* This doesn't exist in the ARC. Destroy.
|
|
|
|
* arc_hdr_destroy() will call list_remove()
|
|
|
|
* and decrement arcstat_l2_size.
|
|
|
|
*/
|
|
|
|
arc_change_state(arc_anon, ab, hash_lock);
|
|
|
|
arc_hdr_destroy(ab);
|
|
|
|
} else {
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Invalidate issued or about to be issued
|
|
|
|
* reads, since we may be about to write
|
|
|
|
* over this location.
|
|
|
|
*/
|
|
|
|
if (HDR_L2_READING(ab)) {
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_evict_reading);
|
|
|
|
ab->b_flags |= ARC_L2_EVICTED;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Tell ARC this no longer exists in L2ARC.
|
|
|
|
*/
|
|
|
|
if (ab->b_l2hdr != NULL) {
|
|
|
|
abl2 = ab->b_l2hdr;
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, -abl2->b_asize);
|
2014-05-22 13:11:57 +04:00
|
|
|
bytes_evicted += abl2->b_asize;
|
2008-11-20 23:01:55 +03:00
|
|
|
ab->b_l2hdr = NULL;
|
2013-11-20 01:34:46 +04:00
|
|
|
kmem_cache_free(l2arc_hdr_cache, abl2);
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
arc_space_return(L2HDR_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
|
|
|
|
}
|
|
|
|
list_remove(buflist, ab);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This may have been leftover after a
|
|
|
|
* failed write.
|
|
|
|
*/
|
|
|
|
ab->b_flags &= ~ARC_L2_WRITING;
|
|
|
|
}
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
|
|
|
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(dev->l2ad_vdev, -bytes_evicted, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
dev->l2ad_evict = taddr;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find and write ARC buffers to the L2ARC device.
|
|
|
|
*
|
|
|
|
* An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
|
|
|
|
* for reading until they have completed writing.
|
2013-08-02 00:02:10 +04:00
|
|
|
* The headroom_boost is an in-out parameter used to maintain headroom boost
|
|
|
|
* state between calls to this function.
|
|
|
|
*
|
|
|
|
* Returns the number of bytes actually written (which may be smaller than
|
|
|
|
* the delta by which the device hand has changed due to alignment).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
static uint64_t
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz,
|
|
|
|
boolean_t *headroom_boost)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
arc_buf_hdr_t *ab, *ab_prev, *head;
|
|
|
|
list_t *list;
|
2013-08-02 00:02:10 +04:00
|
|
|
uint64_t write_asize, write_psize, write_sz, headroom,
|
|
|
|
buf_compress_minsz;
|
2008-11-20 23:01:55 +03:00
|
|
|
void *buf_data;
|
2013-08-02 00:02:10 +04:00
|
|
|
kmutex_t *list_lock = NULL;
|
|
|
|
boolean_t full;
|
2008-11-20 23:01:55 +03:00
|
|
|
l2arc_write_callback_t *cb;
|
|
|
|
zio_t *pio, *wzio;
|
2011-11-12 02:07:54 +04:00
|
|
|
uint64_t guid = spa_load_guid(spa);
|
2010-08-26 20:52:39 +04:00
|
|
|
int try;
|
2013-08-02 00:02:10 +04:00
|
|
|
const boolean_t do_headroom_boost = *headroom_boost;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(dev->l2ad_vdev != NULL);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/* Lower the flag now, we might want to raise it again later. */
|
|
|
|
*headroom_boost = B_FALSE;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
pio = NULL;
|
2013-08-02 00:02:10 +04:00
|
|
|
write_sz = write_asize = write_psize = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
full = B_FALSE;
|
|
|
|
head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
|
|
|
|
head->b_flags |= ARC_L2_WRITE_HEAD;
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
/*
|
|
|
|
* We will want to try to compress buffers that are at least 2x the
|
|
|
|
* device sector size.
|
|
|
|
*/
|
|
|
|
buf_compress_minsz = 2 << dev->l2ad_vdev->vdev_ashift;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Copy buffers for L2ARC writing.
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_buflist_mtx);
|
2010-08-26 20:52:39 +04:00
|
|
|
for (try = 0; try <= 3; try++) {
|
2013-08-02 00:02:10 +04:00
|
|
|
uint64_t passed_sz = 0;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
list = l2arc_list_locked(try, &list_lock);
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* L2ARC fast warmup.
|
|
|
|
*
|
|
|
|
* Until the ARC is warm and starts to evict, read from the
|
|
|
|
* head of the ARC lists rather than the tail.
|
|
|
|
*/
|
|
|
|
if (arc_warm == B_FALSE)
|
|
|
|
ab = list_head(list);
|
|
|
|
else
|
|
|
|
ab = list_tail(list);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
headroom = target_sz * l2arc_headroom;
|
|
|
|
if (do_headroom_boost)
|
|
|
|
headroom = (headroom * l2arc_headroom_boost) / 100;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
for (; ab; ab = ab_prev) {
|
2013-08-02 00:02:10 +04:00
|
|
|
l2arc_buf_hdr_t *l2hdr;
|
|
|
|
kmutex_t *hash_lock;
|
|
|
|
uint64_t buf_sz;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (arc_warm == B_FALSE)
|
|
|
|
ab_prev = list_next(list, ab);
|
|
|
|
else
|
|
|
|
ab_prev = list_prev(list, ab);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
hash_lock = HDR_LOCK(ab);
|
2013-08-02 00:02:10 +04:00
|
|
|
if (!mutex_tryenter(hash_lock)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Skip this buffer rather than waiting.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
passed_sz += ab->b_size;
|
|
|
|
if (passed_sz > headroom) {
|
|
|
|
/*
|
|
|
|
* Searched too far.
|
|
|
|
*/
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
if (!l2arc_write_eligible(guid, ab)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(hash_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((write_sz + ab->b_size) > target_sz) {
|
|
|
|
full = B_TRUE;
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (pio == NULL) {
|
|
|
|
/*
|
|
|
|
* Insert a dummy header on the buflist so
|
|
|
|
* l2arc_write_done() can find where the
|
|
|
|
* write buffers begin without searching.
|
|
|
|
*/
|
|
|
|
list_insert_head(dev->l2ad_buflist, head);
|
|
|
|
|
2012-04-10 21:55:17 +04:00
|
|
|
cb = kmem_alloc(sizeof (l2arc_write_callback_t),
|
2013-11-01 23:26:11 +04:00
|
|
|
KM_PUSHPAGE);
|
2008-11-20 23:01:55 +03:00
|
|
|
cb->l2wcb_dev = dev;
|
|
|
|
cb->l2wcb_head = head;
|
|
|
|
pio = zio_root(spa, l2arc_write_done, cb,
|
|
|
|
ZIO_FLAG_CANFAIL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create and add a new L2ARC header.
|
|
|
|
*/
|
2013-11-20 01:34:46 +04:00
|
|
|
l2hdr = kmem_cache_alloc(l2arc_hdr_cache, KM_PUSHPAGE);
|
2013-08-02 00:02:10 +04:00
|
|
|
l2hdr->b_dev = dev;
|
2013-11-20 01:34:46 +04:00
|
|
|
l2hdr->b_daddr = 0;
|
Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-06-22 16:35:18 +04:00
|
|
|
arc_space_consume(L2HDR_SIZE, ARC_SPACE_L2HDRS);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ab->b_flags |= ARC_L2_WRITING;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Temporarily stash the data buffer in b_tmp_cdata.
|
|
|
|
* The subsequent write step will pick it up from
|
|
|
|
* there. This is because can't access ab->b_buf
|
|
|
|
* without holding the hash_lock, which we in turn
|
|
|
|
* can't access without holding the ARC list locks
|
|
|
|
* (which we want to avoid during compression/writing)
|
|
|
|
*/
|
|
|
|
l2hdr->b_compress = ZIO_COMPRESS_OFF;
|
|
|
|
l2hdr->b_asize = ab->b_size;
|
|
|
|
l2hdr->b_tmp_cdata = ab->b_buf->b_data;
|
2013-10-03 04:11:19 +04:00
|
|
|
l2hdr->b_hits = 0;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
buf_sz = ab->b_size;
|
2013-08-02 00:02:10 +04:00
|
|
|
ab->b_l2hdr = l2hdr;
|
|
|
|
|
|
|
|
list_insert_head(dev->l2ad_buflist, ab);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute and store the buffer cksum before
|
|
|
|
* writing. On debug the cksum is verified first.
|
|
|
|
*/
|
|
|
|
arc_cksum_verify(ab->b_buf);
|
|
|
|
arc_cksum_compute(ab->b_buf, B_TRUE);
|
|
|
|
|
|
|
|
mutex_exit(hash_lock);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
write_sz += buf_sz;
|
|
|
|
}
|
|
|
|
|
|
|
|
mutex_exit(list_lock);
|
|
|
|
|
|
|
|
if (full == B_TRUE)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* No buffers selected for writing? */
|
|
|
|
if (pio == NULL) {
|
|
|
|
ASSERT0(write_sz);
|
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
|
|
|
kmem_cache_free(hdr_cache, head);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now start writing the buffers. We're starting at the write head
|
|
|
|
* and work backwards, retracing the course of the buffer selector
|
|
|
|
* loop above.
|
|
|
|
*/
|
|
|
|
for (ab = list_prev(dev->l2ad_buflist, head); ab;
|
|
|
|
ab = list_prev(dev->l2ad_buflist, ab)) {
|
|
|
|
l2arc_buf_hdr_t *l2hdr;
|
|
|
|
uint64_t buf_sz;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We shouldn't need to lock the buffer here, since we flagged
|
|
|
|
* it as ARC_L2_WRITING in the previous step, but we must take
|
|
|
|
* care to only access its L2 cache parameters. In particular,
|
|
|
|
* ab->b_buf may be invalid by now due to ARC eviction.
|
|
|
|
*/
|
|
|
|
l2hdr = ab->b_l2hdr;
|
|
|
|
l2hdr->b_daddr = dev->l2ad_hand;
|
|
|
|
|
|
|
|
if (!l2arc_nocompress && (ab->b_flags & ARC_L2COMPRESS) &&
|
|
|
|
l2hdr->b_asize >= buf_compress_minsz) {
|
|
|
|
if (l2arc_compress_buf(l2hdr)) {
|
|
|
|
/*
|
|
|
|
* If compression succeeded, enable headroom
|
|
|
|
* boost on the next scan cycle.
|
|
|
|
*/
|
|
|
|
*headroom_boost = B_TRUE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pick up the buffer data we had previously stashed away
|
|
|
|
* (and now potentially also compressed).
|
|
|
|
*/
|
|
|
|
buf_data = l2hdr->b_tmp_cdata;
|
|
|
|
buf_sz = l2hdr->b_asize;
|
|
|
|
|
|
|
|
/* Compression may have squashed the buffer to zero length. */
|
|
|
|
if (buf_sz != 0) {
|
|
|
|
uint64_t buf_p_sz;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
wzio = zio_write_phys(pio, dev->l2ad_vdev,
|
|
|
|
dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
|
|
|
|
NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
|
|
|
|
ZIO_FLAG_CANFAIL, B_FALSE);
|
|
|
|
|
|
|
|
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
|
|
|
|
zio_t *, wzio);
|
|
|
|
(void) zio_nowait(wzio);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
write_asize += buf_sz;
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* Keep the clock hand suitably device-aligned.
|
|
|
|
*/
|
2013-08-02 00:02:10 +04:00
|
|
|
buf_p_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
|
|
|
|
write_psize += buf_p_sz;
|
|
|
|
dev->l2ad_hand += buf_p_sz;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
mutex_exit(&l2arc_buflist_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
ASSERT3U(write_asize, <=, target_sz);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_writes_sent);
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_write_bytes, write_asize);
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_INCR(arcstat_l2_size, write_sz);
|
2013-08-02 00:02:10 +04:00
|
|
|
ARCSTAT_INCR(arcstat_l2_asize, write_asize);
|
2014-05-22 13:11:57 +04:00
|
|
|
vdev_space_update(dev->l2ad_vdev, write_asize, 0, 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Bump device hand to the device start if it is approaching the end.
|
|
|
|
* l2arc_evict() will already have evicted ahead for this case.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
|
2008-11-20 23:01:55 +03:00
|
|
|
dev->l2ad_hand = dev->l2ad_start;
|
|
|
|
dev->l2ad_evict = dev->l2ad_start;
|
|
|
|
dev->l2ad_first = B_FALSE;
|
|
|
|
}
|
|
|
|
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_TRUE;
|
2008-11-20 23:01:55 +03:00
|
|
|
(void) zio_wait(pio);
|
2009-02-18 23:51:31 +03:00
|
|
|
dev->l2ad_writing = B_FALSE;
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
return (write_asize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compresses an L2ARC buffer.
|
|
|
|
* The data to be compressed must be prefilled in l2hdr->b_tmp_cdata and its
|
|
|
|
* size in l2hdr->b_asize. This routine tries to compress the data and
|
|
|
|
* depending on the compression result there are three possible outcomes:
|
|
|
|
* *) The buffer was incompressible. The original l2hdr contents were left
|
|
|
|
* untouched and are ready for writing to an L2 device.
|
|
|
|
* *) The buffer was all-zeros, so there is no need to write it to an L2
|
|
|
|
* device. To indicate this situation b_tmp_cdata is NULL'ed, b_asize is
|
|
|
|
* set to zero and b_compress is set to ZIO_COMPRESS_EMPTY.
|
|
|
|
* *) Compression succeeded and b_tmp_cdata was replaced with a temporary
|
|
|
|
* data buffer which holds the compressed data to be written, and b_asize
|
|
|
|
* tells us how much data there is. b_compress is set to the appropriate
|
|
|
|
* compression algorithm. Once writing is done, invoke
|
|
|
|
* l2arc_release_cdata_buf on this l2hdr to free this temporary buffer.
|
|
|
|
*
|
|
|
|
* Returns B_TRUE if compression succeeded, or B_FALSE if it didn't (the
|
|
|
|
* buffer was incompressible).
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr)
|
|
|
|
{
|
|
|
|
void *cdata;
|
2014-06-06 01:19:08 +04:00
|
|
|
size_t csize, len, rounded;
|
2013-08-02 00:02:10 +04:00
|
|
|
|
|
|
|
ASSERT(l2hdr->b_compress == ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT(l2hdr->b_tmp_cdata != NULL);
|
|
|
|
|
|
|
|
len = l2hdr->b_asize;
|
|
|
|
cdata = zio_data_buf_alloc(len);
|
|
|
|
csize = zio_compress_data(ZIO_COMPRESS_LZ4, l2hdr->b_tmp_cdata,
|
|
|
|
cdata, l2hdr->b_asize);
|
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
rounded = P2ROUNDUP(csize, (size_t)SPA_MINBLOCKSIZE);
|
|
|
|
if (rounded > csize) {
|
|
|
|
bzero((char *)cdata + csize, rounded - csize);
|
|
|
|
csize = rounded;
|
|
|
|
}
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
if (csize == 0) {
|
|
|
|
/* zero block, indicate that there's nothing to write */
|
|
|
|
zio_data_buf_free(cdata, len);
|
|
|
|
l2hdr->b_compress = ZIO_COMPRESS_EMPTY;
|
|
|
|
l2hdr->b_asize = 0;
|
|
|
|
l2hdr->b_tmp_cdata = NULL;
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_compress_zeros);
|
|
|
|
return (B_TRUE);
|
|
|
|
} else if (csize > 0 && csize < len) {
|
|
|
|
/*
|
|
|
|
* Compression succeeded, we'll keep the cdata around for
|
|
|
|
* writing and release it afterwards.
|
|
|
|
*/
|
|
|
|
l2hdr->b_compress = ZIO_COMPRESS_LZ4;
|
|
|
|
l2hdr->b_asize = csize;
|
|
|
|
l2hdr->b_tmp_cdata = cdata;
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_compress_successes);
|
|
|
|
return (B_TRUE);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Compression failed, release the compressed buffer.
|
|
|
|
* l2hdr will be left unmodified.
|
|
|
|
*/
|
|
|
|
zio_data_buf_free(cdata, len);
|
|
|
|
ARCSTAT_BUMP(arcstat_l2_compress_failures);
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Decompresses a zio read back from an l2arc device. On success, the
|
|
|
|
* underlying zio's io_data buffer is overwritten by the uncompressed
|
|
|
|
* version. On decompression error (corrupt compressed stream), the
|
|
|
|
* zio->io_error value is set to signal an I/O error.
|
|
|
|
*
|
|
|
|
* Please note that the compressed data stream is not checksummed, so
|
|
|
|
* if the underlying device is experiencing data corruption, we may feed
|
|
|
|
* corrupt data to the decompressor, so the decompressor needs to be
|
|
|
|
* able to handle this situation (LZ4 does).
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_decompress_zio(zio_t *zio, arc_buf_hdr_t *hdr, enum zio_compress c)
|
|
|
|
{
|
|
|
|
uint64_t csize;
|
|
|
|
void *cdata;
|
|
|
|
|
|
|
|
ASSERT(L2ARC_IS_VALID_COMPRESS(c));
|
|
|
|
|
|
|
|
if (zio->io_error != 0) {
|
|
|
|
/*
|
|
|
|
* An io error has occured, just restore the original io
|
|
|
|
* size in preparation for a main pool read.
|
|
|
|
*/
|
|
|
|
zio->io_orig_size = zio->io_size = hdr->b_size;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (c == ZIO_COMPRESS_EMPTY) {
|
|
|
|
/*
|
|
|
|
* An empty buffer results in a null zio, which means we
|
|
|
|
* need to fill its io_data after we're done restoring the
|
|
|
|
* buffer's contents.
|
|
|
|
*/
|
|
|
|
ASSERT(hdr->b_buf != NULL);
|
|
|
|
bzero(hdr->b_buf->b_data, hdr->b_size);
|
|
|
|
zio->io_data = zio->io_orig_data = hdr->b_buf->b_data;
|
|
|
|
} else {
|
|
|
|
ASSERT(zio->io_data != NULL);
|
|
|
|
/*
|
|
|
|
* We copy the compressed data from the start of the arc buffer
|
|
|
|
* (the zio_read will have pulled in only what we need, the
|
|
|
|
* rest is garbage which we will overwrite at decompression)
|
|
|
|
* and then decompress back to the ARC data buffer. This way we
|
|
|
|
* can minimize copying by simply decompressing back over the
|
|
|
|
* original compressed data (rather than decompressing to an
|
|
|
|
* aux buffer and then copying back the uncompressed buffer,
|
|
|
|
* which is likely to be much larger).
|
|
|
|
*/
|
|
|
|
csize = zio->io_size;
|
|
|
|
cdata = zio_data_buf_alloc(csize);
|
|
|
|
bcopy(zio->io_data, cdata, csize);
|
|
|
|
if (zio_decompress_data(c, cdata, zio->io_data, csize,
|
|
|
|
hdr->b_size) != 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
zio->io_error = SET_ERROR(EIO);
|
2013-08-02 00:02:10 +04:00
|
|
|
zio_data_buf_free(cdata, csize);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Restore the expected uncompressed IO size. */
|
|
|
|
zio->io_orig_size = zio->io_size = hdr->b_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Releases the temporary b_tmp_cdata buffer in an l2arc header structure.
|
|
|
|
* This buffer serves as a temporary holder of compressed data while
|
|
|
|
* the buffer entry is being written to an l2arc device. Once that is
|
|
|
|
* done, we can dispose of it.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_release_cdata_buf(arc_buf_hdr_t *ab)
|
|
|
|
{
|
|
|
|
l2arc_buf_hdr_t *l2hdr = ab->b_l2hdr;
|
|
|
|
|
|
|
|
if (l2hdr->b_compress == ZIO_COMPRESS_LZ4) {
|
|
|
|
/*
|
|
|
|
* If the data was compressed, then we've allocated a
|
|
|
|
* temporary buffer for it, so now we need to release it.
|
|
|
|
*/
|
|
|
|
ASSERT(l2hdr->b_tmp_cdata != NULL);
|
|
|
|
zio_data_buf_free(l2hdr->b_tmp_cdata, ab->b_size);
|
|
|
|
}
|
|
|
|
l2hdr->b_tmp_cdata = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This thread feeds the L2ARC at regular intervals. This is the beating
|
|
|
|
* heart of the L2ARC.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
l2arc_feed_thread(void)
|
|
|
|
{
|
|
|
|
callb_cpr_t cpr;
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
spa_t *spa;
|
2009-02-18 23:51:31 +03:00
|
|
|
uint64_t size, wrote;
|
2010-05-29 00:45:14 +04:00
|
|
|
clock_t begin, next = ddi_get_lbolt();
|
2013-08-02 00:02:10 +04:00
|
|
|
boolean_t headroom_boost = B_FALSE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
|
|
|
|
while (l2arc_thread_exit == 0) {
|
|
|
|
CALLB_CPR_SAFE_BEGIN(&cpr);
|
2010-12-10 23:00:00 +03:00
|
|
|
(void) cv_timedwait_interruptible(&l2arc_feed_thr_cv,
|
|
|
|
&l2arc_feed_thr_lock, next);
|
2008-11-20 23:01:55 +03:00
|
|
|
CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
next = ddi_get_lbolt() + hz;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Quick check for L2ARC devices.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
if (l2arc_ndev == 0) {
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
begin = ddi_get_lbolt();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* This selects the next l2arc device to write to, and in
|
|
|
|
* doing so the next spa to feed from: dev->l2ad_spa. This
|
|
|
|
* will return NULL if there are now no l2arc devices or if
|
|
|
|
* they are all faulted.
|
|
|
|
*
|
|
|
|
* If a device is returned, its spa's config lock is also
|
|
|
|
* held to prevent device removal. l2arc_dev_get_next()
|
|
|
|
* will grab and release l2arc_dev_mtx.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if ((dev = l2arc_dev_get_next()) == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
spa = dev->l2ad_spa;
|
|
|
|
ASSERT(spa != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
/*
|
|
|
|
* If the pool is read-only then force the feed thread to
|
|
|
|
* sleep a little longer.
|
|
|
|
*/
|
|
|
|
if (!spa_writeable(spa)) {
|
|
|
|
next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Avoid contributing to memory pressure.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2012-03-14 01:29:16 +04:00
|
|
|
if (arc_no_grow) {
|
2008-12-03 23:09:06 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
|
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
continue;
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
ARCSTAT_BUMP(arcstat_l2_feeds);
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
size = l2arc_write_size();
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Evict L2ARC buffers that will be overwritten.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_evict(dev, size, B_FALSE);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Write ARC buffers.
|
|
|
|
*/
|
2013-08-02 00:02:10 +04:00
|
|
|
wrote = l2arc_write_buffers(spa, dev, size, &headroom_boost);
|
2009-02-18 23:51:31 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate interval between writes.
|
|
|
|
*/
|
|
|
|
next = l2arc_write_interval(begin, size, wrote);
|
2008-12-03 23:09:06 +03:00
|
|
|
spa_config_exit(spa, SCL_L2ARC, dev);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
cv_broadcast(&l2arc_feed_thr_cv);
|
|
|
|
CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
|
|
|
|
thread_exit();
|
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
boolean_t
|
|
|
|
l2arc_vdev_present(vdev_t *vd)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
for (dev = list_head(l2arc_dev_list); dev != NULL;
|
|
|
|
dev = list_next(l2arc_dev_list, dev)) {
|
|
|
|
if (dev->l2ad_vdev == vd)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
|
|
|
|
return (dev != NULL);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Add a vdev for use by the L2ARC. By this point the spa has already
|
|
|
|
* validated the vdev and opened it.
|
|
|
|
*/
|
|
|
|
void
|
2009-07-03 02:44:48 +04:00
|
|
|
l2arc_add_vdev(spa_t *spa, vdev_t *vd)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
l2arc_dev_t *adddev;
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
ASSERT(!l2arc_vdev_present(vd));
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Create a new l2arc device entry.
|
|
|
|
*/
|
|
|
|
adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
|
|
|
|
adddev->l2ad_spa = spa;
|
|
|
|
adddev->l2ad_vdev = vd;
|
2009-07-03 02:44:48 +04:00
|
|
|
adddev->l2ad_start = VDEV_LABEL_START_SIZE;
|
|
|
|
adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);
|
2008-11-20 23:01:55 +03:00
|
|
|
adddev->l2ad_hand = adddev->l2ad_start;
|
|
|
|
adddev->l2ad_evict = adddev->l2ad_start;
|
|
|
|
adddev->l2ad_first = B_TRUE;
|
2009-02-18 23:51:31 +03:00
|
|
|
adddev->l2ad_writing = B_FALSE;
|
2010-08-26 21:26:44 +04:00
|
|
|
list_link_init(&adddev->l2ad_node);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a list of all ARC buffers that are still valid on the
|
|
|
|
* device.
|
|
|
|
*/
|
|
|
|
adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
|
|
|
|
list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
|
|
|
|
offsetof(arc_buf_hdr_t, b_l2node));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Add device to global list
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
list_insert_head(l2arc_dev_list, adddev);
|
|
|
|
atomic_inc_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a vdev from the L2ARC.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
l2arc_remove_vdev(vdev_t *vd)
|
|
|
|
{
|
|
|
|
l2arc_dev_t *dev, *nextdev, *remdev = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the device by vdev
|
|
|
|
*/
|
|
|
|
mutex_enter(&l2arc_dev_mtx);
|
|
|
|
for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
|
|
|
|
nextdev = list_next(l2arc_dev_list, dev);
|
|
|
|
if (vd == dev->l2ad_vdev) {
|
|
|
|
remdev = dev;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ASSERT(remdev != NULL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove device from global list
|
|
|
|
*/
|
|
|
|
list_remove(l2arc_dev_list, remdev);
|
|
|
|
l2arc_dev_last = NULL; /* may have been invalidated */
|
2008-12-03 23:09:06 +03:00
|
|
|
atomic_dec_64(&l2arc_ndev);
|
|
|
|
mutex_exit(&l2arc_dev_mtx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear all buflists and ARC references. L2ARC device flush.
|
|
|
|
*/
|
|
|
|
l2arc_evict(remdev, 0, B_TRUE);
|
|
|
|
list_destroy(remdev->l2ad_buflist);
|
|
|
|
kmem_free(remdev->l2ad_buflist, sizeof (list_t));
|
|
|
|
kmem_free(remdev, sizeof (l2arc_dev_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_init(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
l2arc_thread_exit = 0;
|
|
|
|
l2arc_ndev = 0;
|
|
|
|
l2arc_writes_sent = 0;
|
|
|
|
l2arc_writes_done = 0;
|
|
|
|
|
|
|
|
mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
|
|
|
l2arc_dev_list = &L2ARC_dev_list;
|
|
|
|
l2arc_free_on_write = &L2ARC_free_on_write;
|
|
|
|
list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
|
|
|
|
offsetof(l2arc_dev_t, l2ad_node));
|
|
|
|
list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
|
|
|
|
offsetof(l2arc_data_free_t, l2df_list_node));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 23:09:06 +03:00
|
|
|
l2arc_fini(void)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2008-12-03 23:09:06 +03:00
|
|
|
/*
|
|
|
|
* This is called from dmu_fini(), which is called from spa_fini();
|
|
|
|
* Because of this, we can assume that all l2arc devices have
|
|
|
|
* already been removed when the pools themselves were removed.
|
|
|
|
*/
|
|
|
|
|
|
|
|
l2arc_do_free_on_write();
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mutex_destroy(&l2arc_feed_thr_lock);
|
|
|
|
cv_destroy(&l2arc_feed_thr_cv);
|
|
|
|
mutex_destroy(&l2arc_dev_mtx);
|
|
|
|
mutex_destroy(&l2arc_buflist_mtx);
|
|
|
|
mutex_destroy(&l2arc_free_on_write_mtx);
|
|
|
|
|
|
|
|
list_destroy(l2arc_dev_list);
|
|
|
|
list_destroy(l2arc_free_on_write);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_start(void)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!(spa_mode_global & FWRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
|
|
|
|
TS_RUN, minclsyspri);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
l2arc_stop(void)
|
|
|
|
{
|
2009-01-16 00:59:39 +03:00
|
|
|
if (!(spa_mode_global & FWRITE))
|
2008-12-03 23:09:06 +03:00
|
|
|
return;
|
|
|
|
|
|
|
|
mutex_enter(&l2arc_feed_thr_lock);
|
|
|
|
cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
|
|
|
|
l2arc_thread_exit = 1;
|
|
|
|
while (l2arc_thread_exit != 0)
|
|
|
|
cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
|
|
|
|
mutex_exit(&l2arc_feed_thr_lock);
|
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
|
|
|
EXPORT_SYMBOL(arc_read);
|
|
|
|
EXPORT_SYMBOL(arc_buf_remove_ref);
|
2013-10-03 04:11:19 +04:00
|
|
|
EXPORT_SYMBOL(arc_buf_info);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(arc_getbuf_func);
|
2011-12-23 00:20:43 +04:00
|
|
|
EXPORT_SYMBOL(arc_add_prune_callback);
|
|
|
|
EXPORT_SYMBOL(arc_remove_prune_callback);
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_min, ulong, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_min, "Min arc size");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_max, ulong, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_max, "Max arc size");
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_meta_limit, ulong, 0644);
|
2010-08-26 22:49:16 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_limit, "Meta limit for arc size");
|
2011-03-31 05:59:17 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_meta_prune, int, 0644);
|
2011-12-23 00:20:43 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_meta_prune, "Bytes of meta data to prune");
|
2011-05-04 02:09:28 +04:00
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_grow_retry, int, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size");
|
|
|
|
|
Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2013-12-11 21:40:13 +04:00
|
|
|
module_param(zfs_arc_p_aggressive_disable, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow");
|
|
|
|
|
2014-01-03 22:36:26 +04:00
|
|
|
module_param(zfs_arc_p_dampener_disable, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_p_dampener_disable, "disable arc_p adapt dampener");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_shrink_shift, int, 0644);
|
2011-05-04 02:09:28 +04:00
|
|
|
MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)");
|
|
|
|
|
2013-02-01 21:18:45 +04:00
|
|
|
module_param(zfs_disable_dup_eviction, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_disable_dup_eviction, "disable duplicate buffer eviction");
|
|
|
|
|
2013-02-01 21:33:04 +04:00
|
|
|
module_param(zfs_arc_memory_throttle_disable, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_memory_throttle_disable, "disable memory throttle");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(zfs_arc_min_prefetch_lifespan, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_arc_min_prefetch_lifespan, "Min life of prefetch block");
|
|
|
|
|
|
|
|
module_param(l2arc_write_max, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_write_max, "Max write bytes per interval");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_write_boost, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_write_boost, "Extra write bytes during device warmup");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_headroom, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_headroom, "Number of max device writes to precache");
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
module_param(l2arc_headroom_boost, ulong, 0644);
|
|
|
|
MODULE_PARM_DESC(l2arc_headroom_boost, "Compressed l2arc_headroom multiplier");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_secs, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_secs, "Seconds between L2ARC writing");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_min_ms, ulong, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_min_ms, "Min feed interval in milliseconds");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_noprefetch, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_noprefetch, "Skip caching prefetched buffers");
|
|
|
|
|
2013-08-02 00:02:10 +04:00
|
|
|
module_param(l2arc_nocompress, int, 0644);
|
|
|
|
MODULE_PARM_DESC(l2arc_nocompress, "Skip compressing L2ARC buffers");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_feed_again, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_feed_again, "Turbo L2ARC warmup");
|
|
|
|
|
2013-07-24 21:14:11 +04:00
|
|
|
module_param(l2arc_norw, int, 0644);
|
2011-07-08 23:41:57 +04:00
|
|
|
MODULE_PARM_DESC(l2arc_norw, "No reads during writes");
|
|
|
|
|
2010-08-26 22:49:16 +04:00
|
|
|
#endif
|