From 914158259225d9723bbb288457797936b41d3c9a Mon Sep 17 00:00:00 2001 From: Brian Behlendorf Date: Tue, 23 Jul 2013 15:33:23 -0700 Subject: [PATCH 01/11] Set zfs_arc_min to 4MB Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf Signed-off-by: Prakash Surya Issue #2110 --- module/zfs/arc.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index cbe0a6028..11839af59 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -4025,8 +4025,8 @@ arc_init(void) spl_register_shrinker(&arc_shrinker); #endif - /* set min cache to 1/32 of all memory, or 64MB, whichever is more */ - arc_c_min = MAX(arc_c / 4, 64<<20); + /* set min cache to zero */ + arc_c_min = 4<<20; /* set max to 1/2 of all memory */ arc_c_max = arc_c * 4; @@ -4036,7 +4036,7 @@ arc_init(void) */ if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE) arc_c_max = zfs_arc_max; - if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max) + if (zfs_arc_min > 0 && zfs_arc_min <= arc_c_max) arc_c_min = zfs_arc_min; arc_c = arc_c_max; @@ -4050,9 +4050,6 @@ arc_init(void) if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max) arc_meta_limit = zfs_arc_meta_limit; - if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0) - arc_c_min = arc_meta_limit / 2; - /* if kmem_flags are set, lets try to use less memory */ if (kmem_debugging()) arc_c = arc_c / 2; From 39e055c44b3025d9ef87f6c6178ef5ae5e2a55c6 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Wed, 11 Dec 2013 12:25:30 -0800 Subject: [PATCH 02/11] Adjust arc_p based on "bytes" in arc_shrink In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit 302f753f1657c05a4287226eeda1f53ae431b8a7 Author: Brian Behlendorf Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- module/zfs/arc.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 11839af59..cbc9b9fee 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -2332,6 +2332,7 @@ void arc_shrink(uint64_t bytes) { if (arc_c > arc_c_min) { + uint64_t arc_p_min; uint64_t to_free; to_free = bytes ? bytes : arc_c >> zfs_arc_shrink_shift; @@ -2341,7 +2342,14 @@ arc_shrink(uint64_t bytes) else arc_c = arc_c_min; - atomic_add_64(&arc_p, -(arc_p >> zfs_arc_shrink_shift)); + arc_p_min = (arc_c >> zfs_arc_p_min_shift); + to_free = bytes ? bytes : arc_p >> zfs_arc_shrink_shift; + + if (arc_p > arc_p_min + to_free) + atomic_add_64(&arc_p, -to_free); + else + arc_p = arc_p_min; + if (arc_c > arc_size) arc_c = MAX(arc_size, arc_c_min); if (arc_p > arc_c) From 89c8cac493687875eecc80a4a03f667d98dd82d0 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Wed, 11 Dec 2013 09:40:13 -0800 Subject: [PATCH 03/11] Disable aggressive arc_p growth by default For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- man/man5/zfs-module-parameters.5 | 11 +++++++++++ module/zfs/arc.c | 9 ++++++++- 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5 index d33da62f4..7f010c6cf 100644 --- a/man/man5/zfs-module-parameters.5 +++ b/man/man5/zfs-module-parameters.5 @@ -304,6 +304,17 @@ arc_c shift to calc min/max arc_p Default value: \fB4\fR. .RE +.sp +.ne 2 +.na +\fBzfs_arc_p_aggressive_disable\fR (int) +.ad +.RS 12n +Disable aggressive arc_p growth +.sp +Use \fB1\fR for yes (default) and \fB0\fR to disable. +.RE + .sp .ne 2 .na diff --git a/module/zfs/arc.c b/module/zfs/arc.c index cbc9b9fee..eac6ea448 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -175,6 +175,9 @@ int zfs_arc_grow_retry = 5; /* shift of arc_c for calculating both min and max arc_p */ int zfs_arc_p_min_shift = 4; +/* disable anon data aggressively growing arc_p */ +int zfs_arc_p_aggressive_disable = 1; + /* log2(fraction of arc to reclaim) */ int zfs_arc_shrink_shift = 5; @@ -2798,7 +2801,8 @@ out: * If we are growing the cache, and we are adding anonymous * data, and we have outgrown arc_p, update arc_p */ - if (arc_size < arc_c && hdr->b_state == arc_anon && + if (!zfs_arc_p_aggressive_disable && + arc_size < arc_c && hdr->b_state == arc_anon && arc_anon->arcs_size + arc_mru->arcs_size > arc_p) arc_p = MIN(arc_c, arc_p + size); } @@ -5553,6 +5557,9 @@ MODULE_PARM_DESC(zfs_arc_meta_prune, "Bytes of meta data to prune"); module_param(zfs_arc_grow_retry, int, 0644); MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size"); +module_param(zfs_arc_p_aggressive_disable, int, 0644); +MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow"); + module_param(zfs_arc_shrink_shift, int, 0644); MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)"); From f521ce1b9c6102f9175f26548d4c521e115f8d60 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Fri, 3 Jan 2014 10:20:21 -0800 Subject: [PATCH 04/11] Allow "arc_p" to drop to zero or grow to "arc_c" Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- man/man5/zfs-module-parameters.5 | 11 ----------- module/zfs/arc.c | 17 ++++------------- 2 files changed, 4 insertions(+), 24 deletions(-) diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5 index 7f010c6cf..2acf68383 100644 --- a/man/man5/zfs-module-parameters.5 +++ b/man/man5/zfs-module-parameters.5 @@ -293,17 +293,6 @@ Min life of prefetch block Default value: \fB100\fR. .RE -.sp -.ne 2 -.na -\fBzfs_arc_p_min_shift\fR (int) -.ad -.RS 12n -arc_c shift to calc min/max arc_p -.sp -Default value: \fB4\fR. -.RE - .sp .ne 2 .na diff --git a/module/zfs/arc.c b/module/zfs/arc.c index eac6ea448..f66eaa407 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -172,9 +172,6 @@ int arc_evict_iterations = 100; /* number of seconds before growing cache again */ int zfs_arc_grow_retry = 5; -/* shift of arc_c for calculating both min and max arc_p */ -int zfs_arc_p_min_shift = 4; - /* disable anon data aggressively growing arc_p */ int zfs_arc_p_aggressive_disable = 1; @@ -2335,7 +2332,6 @@ void arc_shrink(uint64_t bytes) { if (arc_c > arc_c_min) { - uint64_t arc_p_min; uint64_t to_free; to_free = bytes ? bytes : arc_c >> zfs_arc_shrink_shift; @@ -2345,13 +2341,12 @@ arc_shrink(uint64_t bytes) else arc_c = arc_c_min; - arc_p_min = (arc_c >> zfs_arc_p_min_shift); to_free = bytes ? bytes : arc_p >> zfs_arc_shrink_shift; - if (arc_p > arc_p_min + to_free) + if (arc_p > to_free) atomic_add_64(&arc_p, -to_free); else - arc_p = arc_p_min; + arc_p = 0; if (arc_c > arc_size) arc_c = MAX(arc_size, arc_c_min); @@ -2622,7 +2617,6 @@ static void arc_adapt(int bytes, arc_state_t *state) { int mult; - uint64_t arc_p_min = (arc_c >> zfs_arc_p_min_shift); if (state == arc_l2c_only) return; @@ -2641,7 +2635,7 @@ arc_adapt(int bytes, arc_state_t *state) 1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size)); mult = MIN(mult, 10); /* avoid wild arc_p adjustment */ - arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult); + arc_p = MIN(arc_c, arc_p + bytes * mult); } else if (state == arc_mfu_ghost) { uint64_t delta; @@ -2650,7 +2644,7 @@ arc_adapt(int bytes, arc_state_t *state) mult = MIN(mult, 10); delta = MIN(bytes * mult, arc_p); - arc_p = MAX(arc_p_min, arc_p - delta); + arc_p = MAX(0, arc_p - delta); } ASSERT((int64_t)arc_p >= 0); @@ -5563,9 +5557,6 @@ MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow"); module_param(zfs_arc_shrink_shift, int, 0644); MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)"); -module_param(zfs_arc_p_min_shift, int, 0644); -MODULE_PARM_DESC(zfs_arc_p_min_shift, "arc_c shift to calc min/max arc_p"); - module_param(zfs_disable_dup_eviction, int, 0644); MODULE_PARM_DESC(zfs_disable_dup_eviction, "disable duplicate buffer eviction"); From 624227854e619d1bf555445f3bc38730fb9278c4 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Fri, 3 Jan 2014 10:36:26 -0800 Subject: [PATCH 05/11] Disable arc_p adapt dampener by default It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- man/man5/zfs-module-parameters.5 | 11 +++++++++++ module/zfs/arc.c | 14 ++++++++++++-- 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5 index 2acf68383..e0d44d22c 100644 --- a/man/man5/zfs-module-parameters.5 +++ b/man/man5/zfs-module-parameters.5 @@ -304,6 +304,17 @@ Disable aggressive arc_p growth Use \fB1\fR for yes (default) and \fB0\fR to disable. .RE +.sp +.ne 2 +.na +\fBzfs_arc_p_dampener_disable\fR (int) +.ad +.RS 12n +Disable arc_p adapt dampener +.sp +Use \fB1\fR for yes (default) and \fB0\fR to disable. +.RE + .sp .ne 2 .na diff --git a/module/zfs/arc.c b/module/zfs/arc.c index f66eaa407..68496783d 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -175,6 +175,9 @@ int zfs_arc_grow_retry = 5; /* disable anon data aggressively growing arc_p */ int zfs_arc_p_aggressive_disable = 1; +/* disable arc_p adapt dampener in arc_adapt */ +int zfs_arc_p_dampener_disable = 1; + /* log2(fraction of arc to reclaim) */ int zfs_arc_shrink_shift = 5; @@ -2633,7 +2636,9 @@ arc_adapt(int bytes, arc_state_t *state) if (state == arc_mru_ghost) { mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ? 1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size)); - mult = MIN(mult, 10); /* avoid wild arc_p adjustment */ + + if (!zfs_arc_p_dampener_disable) + mult = MIN(mult, 10); /* avoid wild arc_p adjustment */ arc_p = MIN(arc_c, arc_p + bytes * mult); } else if (state == arc_mfu_ghost) { @@ -2641,7 +2646,9 @@ arc_adapt(int bytes, arc_state_t *state) mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ? 1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size)); - mult = MIN(mult, 10); + + if (!zfs_arc_p_dampener_disable) + mult = MIN(mult, 10); delta = MIN(bytes * mult, arc_p); arc_p = MAX(0, arc_p - delta); @@ -5554,6 +5561,9 @@ MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size"); module_param(zfs_arc_p_aggressive_disable, int, 0644); MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow"); +module_param(zfs_arc_p_dampener_disable, int, 0644); +MODULE_PARM_DESC(zfs_arc_p_dampener_disable, "disable arc_p adapt dampener"); + module_param(zfs_arc_shrink_shift, int, 0644); MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)"); From 1e3cb67b53fba067fd7bf9a13d21b53de4626dc1 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Mon, 23 Dec 2013 11:34:20 -0800 Subject: [PATCH 06/11] Revert "Return -1 from arc_shrinker_func()" This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- module/zfs/arc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 68496783d..ad2e8a92d 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -2583,8 +2583,10 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc) */ if (pages > 0) { arc_kmem_reap_now(ARC_RECLAIM_AGGR, ptob(sc->nr_to_scan)); + pages = btop(arc_evictable_memory()); } else { arc_kmem_reap_now(ARC_RECLAIM_CONS, ptob(sc->nr_to_scan)); + pages = -1; } /* @@ -2604,7 +2606,7 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc) mutex_exit(&arc_reclaim_thr_lock); - return (-1); + return (pages); } SPL_SHRINKER_CALLBACK_WRAPPER(arc_shrinker_func); From 94520ca4626c7b01340473bccdaa3ed038a85a8f Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Fri, 3 Jan 2014 11:40:52 -0800 Subject: [PATCH 07/11] Prune metadata from ghost lists in arc_adjust_meta To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- include/sys/arc.h | 1 - module/zfs/arc.c | 67 ++++++++++++++++++++++++++++++------------ module/zfs/zpl_super.c | 2 +- 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/include/sys/arc.h b/include/sys/arc.h index 9d68d3b43..5c8c1c1a3 100644 --- a/include/sys/arc.h +++ b/include/sys/arc.h @@ -160,7 +160,6 @@ void arc_freed(spa_t *spa, const blkptr_t *bp); void arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private); int arc_buf_evict(arc_buf_t *buf); -void arc_adjust_meta(int64_t adjustment, boolean_t may_prune); void arc_flush(spa_t *spa); void arc_tempreserve_clear(uint64_t reserve); int arc_tempreserve_space(uint64_t reserve, uint64_t txg); diff --git a/module/zfs/arc.c b/module/zfs/arc.c index ad2e8a92d..9c2d0eaab 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -2268,24 +2268,61 @@ arc_do_user_evicts(void) * This is only used to enforce the tunable arc_meta_limit, if we are * unable to evict enough buffers notify the user via the prune callback. */ -void -arc_adjust_meta(int64_t adjustment, boolean_t may_prune) +static void +arc_adjust_meta(void) { - int64_t delta; + int64_t adjustmnt, delta; - if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) { - delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment); + /* + * This slightly differs than the way we evict from the mru in + * arc_adjust because we don't have a "target" value (i.e. no + * "meta" arc_p). As a result, I think we can completely + * cannibalize the metadata in the MRU before we evict the + * metadata from the MFU. I think we probably need to implement a + * "metadata arc_p" value to do this properly. + */ + adjustmnt = arc_meta_used - arc_meta_limit; + + if (adjustmnt > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) { + delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustmnt); arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_METADATA); - adjustment -= delta; + adjustmnt -= delta; } - if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) { - delta = MIN(arc_mfu->arcs_lsize[ARC_BUFC_METADATA], adjustment); + /* + * We can't afford to recalculate adjustmnt here. If we do, + * new metadata buffers can sneak into the MRU or ANON lists, + * thus penalize the MFU metadata. Although the fudge factor is + * small, it has been empirically shown to be significant for + * certain workloads (e.g. creating many empty directories). As + * such, we use the original calculation for adjustmnt, and + * simply decrement the amount of data evicted from the MRU. + */ + + if (adjustmnt > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) { + delta = MIN(arc_mfu->arcs_lsize[ARC_BUFC_METADATA], adjustmnt); arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_METADATA); - adjustment -= delta; } - if (may_prune && (adjustment > 0) && (arc_meta_used > arc_meta_limit)) + adjustmnt = arc_mru->arcs_lsize[ARC_BUFC_METADATA] + + arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit; + + if (adjustmnt > 0 && arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) { + delta = MIN(adjustmnt, + arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA]); + arc_evict_ghost(arc_mru_ghost, 0, delta, ARC_BUFC_METADATA); + } + + adjustmnt = arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] + + arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit; + + if (adjustmnt > 0 && arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) { + delta = MIN(adjustmnt, + arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA]); + arc_evict_ghost(arc_mfu_ghost, 0, delta, ARC_BUFC_METADATA); + } + + if (arc_meta_used > arc_meta_limit) arc_do_user_prune(zfs_arc_meta_prune); } @@ -2405,7 +2442,6 @@ static void arc_adapt_thread(void) { callb_cpr_t cpr; - int64_t prune; CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG); @@ -2441,14 +2477,7 @@ arc_adapt_thread(void) if (arc_no_grow && ddi_get_lbolt() >= arc_grow_time) arc_no_grow = FALSE; - /* - * Keep meta data usage within limits, arc_shrink() is not - * used to avoid collapsing the arc_c value when only the - * arc_meta_limit is being exceeded. - */ - prune = (int64_t)arc_meta_used - (int64_t)arc_meta_limit; - if (prune > 0) - arc_adjust_meta(prune, B_TRUE); + arc_adjust_meta(); arc_adjust(); diff --git a/module/zfs/zpl_super.c b/module/zfs/zpl_super.c index b4e7b6ed0..45639a6dd 100644 --- a/module/zfs/zpl_super.c +++ b/module/zfs/zpl_super.c @@ -342,7 +342,7 @@ zpl_nr_cached_objects(struct super_block *sb) static void zpl_free_cached_objects(struct super_block *sb, int nr_to_scan) { - arc_adjust_meta(nr_to_scan * sizeof (znode_t), B_FALSE); + /* noop */ } #endif /* HAVE_FREE_CACHED_OBJECTS */ From 77765b540b79bdc42d4f25f174004bbbd06b0a32 Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Fri, 3 Jan 2014 10:11:14 -0800 Subject: [PATCH 08/11] Remove "arc_meta_used" from arc_adjust calculation Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- module/zfs/arc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 9c2d0eaab..f28748423 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -2144,8 +2144,7 @@ arc_adjust(void) */ adjustment = MIN((int64_t)(arc_size - arc_c), - (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used - - arc_p)); + (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size - arc_p)); if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) { delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment); From da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Mon, 30 Dec 2013 09:30:00 -0800 Subject: [PATCH 09/11] Prioritize "metadata" in arc_get_data_buf When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- module/zfs/arc.c | 62 +++++++++++++++++++++++++++++++----------------- 1 file changed, 40 insertions(+), 22 deletions(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index f28748423..242d0c8c5 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -1890,6 +1890,7 @@ arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle, evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost; +top: mutex_enter(&state->arcs_mtx); mutex_enter(&evicted_state->arcs_mtx); @@ -2005,6 +2006,15 @@ arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle, mutex_exit(&evicted_state->arcs_mtx); mutex_exit(&state->arcs_mtx); + if (list == &state->arcs_list[ARC_BUFC_DATA] && + (bytes < 0 || bytes_evicted < bytes)) { + /* Prevent second pass from recycling metadata into data */ + recycle = FALSE; + type = ARC_BUFC_METADATA; + list = &state->arcs_list[type]; + goto top; + } + if (bytes_evicted < bytes) dprintf("only evicted %lld bytes from %x\n", (longlong_t)bytes_evicted, state); @@ -2146,16 +2156,9 @@ arc_adjust(void) adjustment = MIN((int64_t)(arc_size - arc_c), (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size - arc_p)); - if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) { - delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment); + if (adjustment > 0 && arc_mru->arcs_size > 0) { + delta = MIN(arc_mru->arcs_size, adjustment); (void) arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_DATA); - adjustment -= delta; - } - - if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) { - delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment); - (void) arc_evict(arc_mru, 0, delta, FALSE, - ARC_BUFC_METADATA); } /* @@ -2164,17 +2167,9 @@ arc_adjust(void) adjustment = arc_size - arc_c; - if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) { - delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]); + if (adjustment > 0 && arc_mfu->arcs_size > 0) { + delta = MIN(arc_mfu->arcs_size, adjustment); (void) arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_DATA); - adjustment -= delta; - } - - if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) { - int64_t delta = MIN(adjustment, - arc_mfu->arcs_lsize[ARC_BUFC_METADATA]); - (void) arc_evict(arc_mfu, 0, delta, FALSE, - ARC_BUFC_METADATA); } /* @@ -2752,6 +2747,8 @@ arc_get_data_buf(arc_buf_t *buf) arc_state_t *state = buf->b_hdr->b_state; uint64_t size = buf->b_hdr->b_size; arc_buf_contents_t type = buf->b_hdr->b_type; + arc_buf_contents_t evict = ARC_BUFC_DATA; + boolean_t recycle = TRUE; arc_adapt(size, state); @@ -2792,7 +2789,24 @@ arc_get_data_buf(arc_buf_t *buf) mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu; } - if ((buf->b_data = arc_evict(state, 0, size, TRUE, type)) == NULL) { + /* + * Evict data buffers prior to metadata buffers, unless we're + * over the metadata limit and adding a metadata buffer. + */ + if (type == ARC_BUFC_METADATA) { + if (arc_meta_used >= arc_meta_limit) + evict = ARC_BUFC_METADATA; + else + /* + * In this case, we're evicting data while + * adding metadata. Thus, to prevent recycling a + * data buffer into a metadata buffer, recycling + * is disabled in the following arc_evict call. + */ + recycle = FALSE; + } + + if ((buf->b_data = arc_evict(state, 0, size, recycle, evict)) == NULL) { if (type == ARC_BUFC_METADATA) { buf->b_data = zio_buf_alloc(size); arc_space_consume(size, ARC_SPACE_DATA); @@ -2803,8 +2817,10 @@ arc_get_data_buf(arc_buf_t *buf) * via the prune callback to drop references. The * prune callback in run in the context of the reclaim * thread to avoid deadlocking on the hash_lock. + * Of course, only do this when recycle is true. */ - cv_signal(&arc_reclaim_thr_cv); + if (recycle) + cv_signal(&arc_reclaim_thr_cv); } else { ASSERT(type == ARC_BUFC_DATA); buf->b_data = zio_data_buf_alloc(size); @@ -2812,7 +2828,9 @@ arc_get_data_buf(arc_buf_t *buf) atomic_add_64(&arc_size, size); } - ARCSTAT_BUMP(arcstat_recycle_miss); + /* Only bump this if we tried to recycle and failed */ + if (recycle) + ARCSTAT_BUMP(arcstat_recycle_miss); } ASSERT(buf->b_data != NULL); out: From cc7f677c1610bd4ae11676ba02ec8987c347b83d Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Mon, 3 Feb 2014 12:41:47 -0800 Subject: [PATCH 10/11] Split "data_size" into "meta" and "data" Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- include/sys/arc.h | 1 + module/zfs/arc.c | 38 ++++++++++++++++++++++++-------------- 2 files changed, 25 insertions(+), 14 deletions(-) diff --git a/include/sys/arc.h b/include/sys/arc.h index 5c8c1c1a3..005d07179 100644 --- a/include/sys/arc.h +++ b/include/sys/arc.h @@ -86,6 +86,7 @@ typedef enum arc_buf_contents { */ typedef enum arc_space_type { ARC_SPACE_DATA, + ARC_SPACE_META, ARC_SPACE_HDRS, ARC_SPACE_L2HDRS, ARC_SPACE_OTHER, diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 242d0c8c5..82b35bb62 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -308,6 +308,7 @@ typedef struct arc_stats { kstat_named_t arcstat_size; kstat_named_t arcstat_hdr_size; kstat_named_t arcstat_data_size; + kstat_named_t arcstat_meta_size; kstat_named_t arcstat_other_size; kstat_named_t arcstat_anon_size; kstat_named_t arcstat_anon_evict_data; @@ -395,6 +396,7 @@ static arc_stats_t arc_stats = { { "size", KSTAT_DATA_UINT64 }, { "hdr_size", KSTAT_DATA_UINT64 }, { "data_size", KSTAT_DATA_UINT64 }, + { "meta_size", KSTAT_DATA_UINT64 }, { "other_size", KSTAT_DATA_UINT64 }, { "anon_size", KSTAT_DATA_UINT64 }, { "anon_evict_data", KSTAT_DATA_UINT64 }, @@ -1367,6 +1369,9 @@ arc_space_consume(uint64_t space, arc_space_type_t type) case ARC_SPACE_DATA: ARCSTAT_INCR(arcstat_data_size, space); break; + case ARC_SPACE_META: + ARCSTAT_INCR(arcstat_meta_size, space); + break; case ARC_SPACE_OTHER: ARCSTAT_INCR(arcstat_other_size, space); break; @@ -1378,7 +1383,9 @@ arc_space_consume(uint64_t space, arc_space_type_t type) break; } - ARCSTAT_INCR(arcstat_meta_used, space); + if (type != ARC_SPACE_DATA) + ARCSTAT_INCR(arcstat_meta_used, space); + atomic_add_64(&arc_size, space); } @@ -1393,6 +1400,9 @@ arc_space_return(uint64_t space, arc_space_type_t type) case ARC_SPACE_DATA: ARCSTAT_INCR(arcstat_data_size, -space); break; + case ARC_SPACE_META: + ARCSTAT_INCR(arcstat_meta_size, -space); + break; case ARC_SPACE_OTHER: ARCSTAT_INCR(arcstat_other_size, -space); break; @@ -1404,10 +1414,13 @@ arc_space_return(uint64_t space, arc_space_type_t type) break; } - ASSERT(arc_meta_used >= space); - if (arc_meta_max < arc_meta_used) - arc_meta_max = arc_meta_used; - ARCSTAT_INCR(arcstat_meta_used, -space); + if (type != ARC_SPACE_DATA) { + ASSERT(arc_meta_used >= space); + if (arc_meta_max < arc_meta_used) + arc_meta_max = arc_meta_used; + ARCSTAT_INCR(arcstat_meta_used, -space); + } + ASSERT(arc_size >= space); atomic_add_64(&arc_size, -space); } @@ -1604,12 +1617,11 @@ arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all) if (!recycle) { if (type == ARC_BUFC_METADATA) { arc_buf_data_free(buf, zio_buf_free); - arc_space_return(size, ARC_SPACE_DATA); + arc_space_return(size, ARC_SPACE_META); } else { ASSERT(type == ARC_BUFC_DATA); arc_buf_data_free(buf, zio_data_buf_free); - ARCSTAT_INCR(arcstat_data_size, -size); - atomic_add_64(&arc_size, -size); + arc_space_return(size, ARC_SPACE_DATA); } } if (list_link_active(&buf->b_hdr->b_arc_node)) { @@ -2759,12 +2771,11 @@ arc_get_data_buf(arc_buf_t *buf) if (!arc_evict_needed(type)) { if (type == ARC_BUFC_METADATA) { buf->b_data = zio_buf_alloc(size); - arc_space_consume(size, ARC_SPACE_DATA); + arc_space_consume(size, ARC_SPACE_META); } else { ASSERT(type == ARC_BUFC_DATA); buf->b_data = zio_data_buf_alloc(size); - ARCSTAT_INCR(arcstat_data_size, size); - atomic_add_64(&arc_size, size); + arc_space_consume(size, ARC_SPACE_DATA); } goto out; } @@ -2809,7 +2820,7 @@ arc_get_data_buf(arc_buf_t *buf) if ((buf->b_data = arc_evict(state, 0, size, recycle, evict)) == NULL) { if (type == ARC_BUFC_METADATA) { buf->b_data = zio_buf_alloc(size); - arc_space_consume(size, ARC_SPACE_DATA); + arc_space_consume(size, ARC_SPACE_META); /* * If we are unable to recycle an existing meta buffer @@ -2824,8 +2835,7 @@ arc_get_data_buf(arc_buf_t *buf) } else { ASSERT(type == ARC_BUFC_DATA); buf->b_data = zio_data_buf_alloc(size); - ARCSTAT_INCR(arcstat_data_size, size); - atomic_add_64(&arc_size, size); + arc_space_consume(size, ARC_SPACE_DATA); } /* Only bump this if we tried to recycle and failed */ From 2b13331d62a58e3efaecd0b7ab884c45c52b8d3b Mon Sep 17 00:00:00 2001 From: Prakash Surya Date: Mon, 3 Feb 2014 12:21:51 -0800 Subject: [PATCH 11/11] Set "arc_meta_limit" to 3/4 arc_c_max by default Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | ---+----------+----------+----------+----------+----------+----------+ 1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 | 2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 | N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 | 8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 | 16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 | 32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 | As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya Signed-off-by: Brian Behlendorf Issue #2110 --- module/zfs/arc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/module/zfs/arc.c b/module/zfs/arc.c index 82b35bb62..ccc9510fd 100644 --- a/module/zfs/arc.c +++ b/module/zfs/arc.c @@ -4113,8 +4113,8 @@ arc_init(void) arc_c = arc_c_max; arc_p = (arc_c >> 1); - /* limit meta-data to 1/4 of the arc capacity */ - arc_meta_limit = arc_c_max / 4; + /* limit meta-data to 3/4 of the arc capacity */ + arc_meta_limit = (3 * arc_c_max) / 4; arc_meta_max = 0; /* Allow the tunable to override if it is reasonable */