From 914158259225d9723bbb288457797936b41d3c9a Mon Sep 17 00:00:00 2001
From: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue, 23 Jul 2013 15:33:23 -0700
Subject: [PATCH 01/11] Set zfs_arc_min to 4MB

Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index cbe0a6028..11839af59 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -4025,8 +4025,8 @@ arc_init(void)
 	spl_register_shrinker(&arc_shrinker);
 #endif
 
-	/* set min cache to 1/32 of all memory, or 64MB, whichever is more */
-	arc_c_min = MAX(arc_c / 4, 64<<20);
+	/* set min cache to zero */
+	arc_c_min = 4<<20;
 	/* set max to 1/2 of all memory */
 	arc_c_max = arc_c * 4;
 
@@ -4036,7 +4036,7 @@ arc_init(void)
 	 */
 	if (zfs_arc_max > 64<<20 && zfs_arc_max < physmem * PAGESIZE)
 		arc_c_max = zfs_arc_max;
-	if (zfs_arc_min > 64<<20 && zfs_arc_min <= arc_c_max)
+	if (zfs_arc_min > 0 && zfs_arc_min <= arc_c_max)
 		arc_c_min = zfs_arc_min;
 
 	arc_c = arc_c_max;
@@ -4050,9 +4050,6 @@ arc_init(void)
 	if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
 		arc_meta_limit = zfs_arc_meta_limit;
 
-	if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
-		arc_c_min = arc_meta_limit / 2;
-
 	/* if kmem_flags are set, lets try to use less memory */
 	if (kmem_debugging())
 		arc_c = arc_c / 2;

From 39e055c44b3025d9ef87f6c6178ef5ae5e2a55c6 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Wed, 11 Dec 2013 12:25:30 -0800
Subject: [PATCH 02/11] Adjust arc_p based on "bytes" in arc_shrink

In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f1657c05a4287226eeda1f53ae431b8a7
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 11839af59..cbc9b9fee 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -2332,6 +2332,7 @@ void
 arc_shrink(uint64_t bytes)
 {
 	if (arc_c > arc_c_min) {
+		uint64_t arc_p_min;
 		uint64_t to_free;
 
 		to_free = bytes ? bytes : arc_c >> zfs_arc_shrink_shift;
@@ -2341,7 +2342,14 @@ arc_shrink(uint64_t bytes)
 		else
 			arc_c = arc_c_min;
 
-		atomic_add_64(&arc_p, -(arc_p >> zfs_arc_shrink_shift));
+		arc_p_min = (arc_c >> zfs_arc_p_min_shift);
+		to_free = bytes ? bytes : arc_p >> zfs_arc_shrink_shift;
+
+		if (arc_p > arc_p_min + to_free)
+			atomic_add_64(&arc_p, -to_free);
+		else
+			arc_p = arc_p_min;
+
 		if (arc_c > arc_size)
 			arc_c = MAX(arc_size, arc_c_min);
 		if (arc_p > arc_c)

From 89c8cac493687875eecc80a4a03f667d98dd82d0 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Wed, 11 Dec 2013 09:40:13 -0800
Subject: [PATCH 03/11] Disable aggressive arc_p growth by default

For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 man/man5/zfs-module-parameters.5 | 11 +++++++++++
 module/zfs/arc.c                 |  9 ++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5
index d33da62f4..7f010c6cf 100644
--- a/man/man5/zfs-module-parameters.5
+++ b/man/man5/zfs-module-parameters.5
@@ -304,6 +304,17 @@ arc_c shift to calc min/max arc_p
 Default value: \fB4\fR.
 .RE
 
+.sp
+.ne 2
+.na
+\fBzfs_arc_p_aggressive_disable\fR (int)
+.ad
+.RS 12n
+Disable aggressive arc_p growth
+.sp
+Use \fB1\fR for yes (default) and \fB0\fR to disable.
+.RE
+
 .sp
 .ne 2
 .na
diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index cbc9b9fee..eac6ea448 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -175,6 +175,9 @@ int zfs_arc_grow_retry = 5;
 /* shift of arc_c for calculating both min and max arc_p */
 int zfs_arc_p_min_shift = 4;
 
+/* disable anon data aggressively growing arc_p */
+int zfs_arc_p_aggressive_disable = 1;
+
 /* log2(fraction of arc to reclaim) */
 int zfs_arc_shrink_shift = 5;
 
@@ -2798,7 +2801,8 @@ out:
 		 * If we are growing the cache, and we are adding anonymous
 		 * data, and we have outgrown arc_p, update arc_p
 		 */
-		if (arc_size < arc_c && hdr->b_state == arc_anon &&
+		if (!zfs_arc_p_aggressive_disable &&
+		    arc_size < arc_c && hdr->b_state == arc_anon &&
 		    arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
 			arc_p = MIN(arc_c, arc_p + size);
 	}
@@ -5553,6 +5557,9 @@ MODULE_PARM_DESC(zfs_arc_meta_prune, "Bytes of meta data to prune");
 module_param(zfs_arc_grow_retry, int, 0644);
 MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size");
 
+module_param(zfs_arc_p_aggressive_disable, int, 0644);
+MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow");
+
 module_param(zfs_arc_shrink_shift, int, 0644);
 MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)");
 

From f521ce1b9c6102f9175f26548d4c521e115f8d60 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Fri, 3 Jan 2014 10:20:21 -0800
Subject: [PATCH 04/11] Allow "arc_p" to drop to zero or grow to "arc_c"

Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 man/man5/zfs-module-parameters.5 | 11 -----------
 module/zfs/arc.c                 | 17 ++++-------------
 2 files changed, 4 insertions(+), 24 deletions(-)

diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5
index 7f010c6cf..2acf68383 100644
--- a/man/man5/zfs-module-parameters.5
+++ b/man/man5/zfs-module-parameters.5
@@ -293,17 +293,6 @@ Min life of prefetch block
 Default value: \fB100\fR.
 .RE
 
-.sp
-.ne 2
-.na
-\fBzfs_arc_p_min_shift\fR (int)
-.ad
-.RS 12n
-arc_c shift to calc min/max arc_p
-.sp
-Default value: \fB4\fR.
-.RE
-
 .sp
 .ne 2
 .na
diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index eac6ea448..f66eaa407 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -172,9 +172,6 @@ int arc_evict_iterations = 100;
 /* number of seconds before growing cache again */
 int zfs_arc_grow_retry = 5;
 
-/* shift of arc_c for calculating both min and max arc_p */
-int zfs_arc_p_min_shift = 4;
-
 /* disable anon data aggressively growing arc_p */
 int zfs_arc_p_aggressive_disable = 1;
 
@@ -2335,7 +2332,6 @@ void
 arc_shrink(uint64_t bytes)
 {
 	if (arc_c > arc_c_min) {
-		uint64_t arc_p_min;
 		uint64_t to_free;
 
 		to_free = bytes ? bytes : arc_c >> zfs_arc_shrink_shift;
@@ -2345,13 +2341,12 @@ arc_shrink(uint64_t bytes)
 		else
 			arc_c = arc_c_min;
 
-		arc_p_min = (arc_c >> zfs_arc_p_min_shift);
 		to_free = bytes ? bytes : arc_p >> zfs_arc_shrink_shift;
 
-		if (arc_p > arc_p_min + to_free)
+		if (arc_p > to_free)
 			atomic_add_64(&arc_p, -to_free);
 		else
-			arc_p = arc_p_min;
+			arc_p = 0;
 
 		if (arc_c > arc_size)
 			arc_c = MAX(arc_size, arc_c_min);
@@ -2622,7 +2617,6 @@ static void
 arc_adapt(int bytes, arc_state_t *state)
 {
 	int mult;
-	uint64_t arc_p_min = (arc_c >> zfs_arc_p_min_shift);
 
 	if (state == arc_l2c_only)
 		return;
@@ -2641,7 +2635,7 @@ arc_adapt(int bytes, arc_state_t *state)
 		    1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
 		mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
 
-		arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
+		arc_p = MIN(arc_c, arc_p + bytes * mult);
 	} else if (state == arc_mfu_ghost) {
 		uint64_t delta;
 
@@ -2650,7 +2644,7 @@ arc_adapt(int bytes, arc_state_t *state)
 		mult = MIN(mult, 10);
 
 		delta = MIN(bytes * mult, arc_p);
-		arc_p = MAX(arc_p_min, arc_p - delta);
+		arc_p = MAX(0, arc_p - delta);
 	}
 	ASSERT((int64_t)arc_p >= 0);
 
@@ -5563,9 +5557,6 @@ MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow");
 module_param(zfs_arc_shrink_shift, int, 0644);
 MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)");
 
-module_param(zfs_arc_p_min_shift, int, 0644);
-MODULE_PARM_DESC(zfs_arc_p_min_shift, "arc_c shift to calc min/max arc_p");
-
 module_param(zfs_disable_dup_eviction, int, 0644);
 MODULE_PARM_DESC(zfs_disable_dup_eviction, "disable duplicate buffer eviction");
 

From 624227854e619d1bf555445f3bc38730fb9278c4 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Fri, 3 Jan 2014 10:36:26 -0800
Subject: [PATCH 05/11] Disable arc_p adapt dampener by default

It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 man/man5/zfs-module-parameters.5 | 11 +++++++++++
 module/zfs/arc.c                 | 14 ++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/man/man5/zfs-module-parameters.5 b/man/man5/zfs-module-parameters.5
index 2acf68383..e0d44d22c 100644
--- a/man/man5/zfs-module-parameters.5
+++ b/man/man5/zfs-module-parameters.5
@@ -304,6 +304,17 @@ Disable aggressive arc_p growth
 Use \fB1\fR for yes (default) and \fB0\fR to disable.
 .RE
 
+.sp
+.ne 2
+.na
+\fBzfs_arc_p_dampener_disable\fR (int)
+.ad
+.RS 12n
+Disable arc_p adapt dampener
+.sp
+Use \fB1\fR for yes (default) and \fB0\fR to disable.
+.RE
+
 .sp
 .ne 2
 .na
diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index f66eaa407..68496783d 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -175,6 +175,9 @@ int zfs_arc_grow_retry = 5;
 /* disable anon data aggressively growing arc_p */
 int zfs_arc_p_aggressive_disable = 1;
 
+/* disable arc_p adapt dampener in arc_adapt */
+int zfs_arc_p_dampener_disable = 1;
+
 /* log2(fraction of arc to reclaim) */
 int zfs_arc_shrink_shift = 5;
 
@@ -2633,7 +2636,9 @@ arc_adapt(int bytes, arc_state_t *state)
 	if (state == arc_mru_ghost) {
 		mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
 		    1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
-		mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
+
+		if (!zfs_arc_p_dampener_disable)
+			mult = MIN(mult, 10); /* avoid wild arc_p adjustment */
 
 		arc_p = MIN(arc_c, arc_p + bytes * mult);
 	} else if (state == arc_mfu_ghost) {
@@ -2641,7 +2646,9 @@ arc_adapt(int bytes, arc_state_t *state)
 
 		mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
 		    1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
-		mult = MIN(mult, 10);
+
+		if (!zfs_arc_p_dampener_disable)
+			mult = MIN(mult, 10);
 
 		delta = MIN(bytes * mult, arc_p);
 		arc_p = MAX(0, arc_p - delta);
@@ -5554,6 +5561,9 @@ MODULE_PARM_DESC(zfs_arc_grow_retry, "Seconds before growing arc size");
 module_param(zfs_arc_p_aggressive_disable, int, 0644);
 MODULE_PARM_DESC(zfs_arc_p_aggressive_disable, "disable aggressive arc_p grow");
 
+module_param(zfs_arc_p_dampener_disable, int, 0644);
+MODULE_PARM_DESC(zfs_arc_p_dampener_disable, "disable arc_p adapt dampener");
+
 module_param(zfs_arc_shrink_shift, int, 0644);
 MODULE_PARM_DESC(zfs_arc_shrink_shift, "log2(fraction of arc to reclaim)");
 

From 1e3cb67b53fba067fd7bf9a13d21b53de4626dc1 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Mon, 23 Dec 2013 11:34:20 -0800
Subject: [PATCH 06/11] Revert "Return -1 from arc_shrinker_func()"

This reverts commit c11a12bc3b2e5ee9a6bd74e26f1a396b6025fbd4.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 68496783d..ad2e8a92d 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -2583,8 +2583,10 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
 	 */
 	if (pages > 0) {
 		arc_kmem_reap_now(ARC_RECLAIM_AGGR, ptob(sc->nr_to_scan));
+		pages = btop(arc_evictable_memory());
 	} else {
 		arc_kmem_reap_now(ARC_RECLAIM_CONS, ptob(sc->nr_to_scan));
+		pages = -1;
 	}
 
 	/*
@@ -2604,7 +2606,7 @@ __arc_shrinker_func(struct shrinker *shrink, struct shrink_control *sc)
 
 	mutex_exit(&arc_reclaim_thr_lock);
 
-	return (-1);
+	return (pages);
 }
 SPL_SHRINKER_CALLBACK_WRAPPER(arc_shrinker_func);
 

From 94520ca4626c7b01340473bccdaa3ed038a85a8f Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Fri, 3 Jan 2014 11:40:52 -0800
Subject: [PATCH 07/11] Prune metadata from ghost lists in arc_adjust_meta

To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 include/sys/arc.h      |  1 -
 module/zfs/arc.c       | 67 ++++++++++++++++++++++++++++++------------
 module/zfs/zpl_super.c |  2 +-
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/include/sys/arc.h b/include/sys/arc.h
index 9d68d3b43..5c8c1c1a3 100644
--- a/include/sys/arc.h
+++ b/include/sys/arc.h
@@ -160,7 +160,6 @@ void arc_freed(spa_t *spa, const blkptr_t *bp);
 void arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private);
 int arc_buf_evict(arc_buf_t *buf);
 
-void arc_adjust_meta(int64_t adjustment, boolean_t may_prune);
 void arc_flush(spa_t *spa);
 void arc_tempreserve_clear(uint64_t reserve);
 int arc_tempreserve_space(uint64_t reserve, uint64_t txg);
diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index ad2e8a92d..9c2d0eaab 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -2268,24 +2268,61 @@ arc_do_user_evicts(void)
  * This is only used to enforce the tunable arc_meta_limit, if we are
  * unable to evict enough buffers notify the user via the prune callback.
  */
-void
-arc_adjust_meta(int64_t adjustment, boolean_t may_prune)
+static void
+arc_adjust_meta(void)
 {
-	int64_t delta;
+	int64_t adjustmnt, delta;
 
-	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
-		delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
+	/*
+	 * This slightly differs than the way we evict from the mru in
+	 * arc_adjust because we don't have a "target" value (i.e. no
+	 * "meta" arc_p). As a result, I think we can completely
+	 * cannibalize the metadata in the MRU before we evict the
+	 * metadata from the MFU. I think we probably need to implement a
+	 * "metadata arc_p" value to do this properly.
+	 */
+	adjustmnt = arc_meta_used - arc_meta_limit;
+
+	if (adjustmnt > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
+		delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustmnt);
 		arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_METADATA);
-		adjustment -= delta;
+		adjustmnt -= delta;
 	}
 
-	if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
-		delta = MIN(arc_mfu->arcs_lsize[ARC_BUFC_METADATA], adjustment);
+	/*
+	 * We can't afford to recalculate adjustmnt here. If we do,
+	 * new metadata buffers can sneak into the MRU or ANON lists,
+	 * thus penalize the MFU metadata. Although the fudge factor is
+	 * small, it has been empirically shown to be significant for
+	 * certain workloads (e.g. creating many empty directories). As
+	 * such, we use the original calculation for adjustmnt, and
+	 * simply decrement the amount of data evicted from the MRU.
+	 */
+
+	if (adjustmnt > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
+		delta = MIN(arc_mfu->arcs_lsize[ARC_BUFC_METADATA], adjustmnt);
 		arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_METADATA);
-		adjustment -= delta;
 	}
 
-	if (may_prune && (adjustment > 0) && (arc_meta_used > arc_meta_limit))
+	adjustmnt = arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
+	    arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit;
+
+	if (adjustmnt > 0 && arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) {
+		delta = MIN(adjustmnt,
+		    arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA]);
+		arc_evict_ghost(arc_mru_ghost, 0, delta, ARC_BUFC_METADATA);
+	}
+
+	adjustmnt = arc_mru_ghost->arcs_lsize[ARC_BUFC_METADATA] +
+	    arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] - arc_meta_limit;
+
+	if (adjustmnt > 0 && arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA] > 0) {
+		delta = MIN(adjustmnt,
+		    arc_mfu_ghost->arcs_lsize[ARC_BUFC_METADATA]);
+		arc_evict_ghost(arc_mfu_ghost, 0, delta, ARC_BUFC_METADATA);
+	}
+
+	if (arc_meta_used > arc_meta_limit)
 		arc_do_user_prune(zfs_arc_meta_prune);
 }
 
@@ -2405,7 +2442,6 @@ static void
 arc_adapt_thread(void)
 {
 	callb_cpr_t		cpr;
-	int64_t			prune;
 
 	CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
 
@@ -2441,14 +2477,7 @@ arc_adapt_thread(void)
 		if (arc_no_grow && ddi_get_lbolt() >= arc_grow_time)
 			arc_no_grow = FALSE;
 
-		/*
-		 * Keep meta data usage within limits, arc_shrink() is not
-		 * used to avoid collapsing the arc_c value when only the
-		 * arc_meta_limit is being exceeded.
-		 */
-		prune = (int64_t)arc_meta_used - (int64_t)arc_meta_limit;
-		if (prune > 0)
-			arc_adjust_meta(prune, B_TRUE);
+		arc_adjust_meta();
 
 		arc_adjust();
 
diff --git a/module/zfs/zpl_super.c b/module/zfs/zpl_super.c
index b4e7b6ed0..45639a6dd 100644
--- a/module/zfs/zpl_super.c
+++ b/module/zfs/zpl_super.c
@@ -342,7 +342,7 @@ zpl_nr_cached_objects(struct super_block *sb)
 static void
 zpl_free_cached_objects(struct super_block *sb, int nr_to_scan)
 {
-	arc_adjust_meta(nr_to_scan * sizeof (znode_t), B_FALSE);
+	/* noop */
 }
 #endif /* HAVE_FREE_CACHED_OBJECTS */
 

From 77765b540b79bdc42d4f25f174004bbbd06b0a32 Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Fri, 3 Jan 2014 10:11:14 -0800
Subject: [PATCH 08/11] Remove "arc_meta_used" from arc_adjust calculation

Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 9c2d0eaab..f28748423 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -2144,8 +2144,7 @@ arc_adjust(void)
 	 */
 
 	adjustment = MIN((int64_t)(arc_size - arc_c),
-	    (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used -
-	    arc_p));
+	    (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size - arc_p));
 
 	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
 		delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);

From da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Mon, 30 Dec 2013 09:30:00 -0800
Subject: [PATCH 09/11] Prioritize "metadata" in arc_get_data_buf

When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 62 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index f28748423..242d0c8c5 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -1890,6 +1890,7 @@ arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
 
 	evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
 
+top:
 	mutex_enter(&state->arcs_mtx);
 	mutex_enter(&evicted_state->arcs_mtx);
 
@@ -2005,6 +2006,15 @@ arc_evict(arc_state_t *state, uint64_t spa, int64_t bytes, boolean_t recycle,
 	mutex_exit(&evicted_state->arcs_mtx);
 	mutex_exit(&state->arcs_mtx);
 
+	if (list == &state->arcs_list[ARC_BUFC_DATA] &&
+	    (bytes < 0 || bytes_evicted < bytes)) {
+		/* Prevent second pass from recycling metadata into data */
+		recycle = FALSE;
+		type = ARC_BUFC_METADATA;
+		list = &state->arcs_list[type];
+		goto top;
+	}
+
 	if (bytes_evicted < bytes)
 		dprintf("only evicted %lld bytes from %x\n",
 		    (longlong_t)bytes_evicted, state);
@@ -2146,16 +2156,9 @@ arc_adjust(void)
 	adjustment = MIN((int64_t)(arc_size - arc_c),
 	    (int64_t)(arc_anon->arcs_size + arc_mru->arcs_size - arc_p));
 
-	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
-		delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
+	if (adjustment > 0 && arc_mru->arcs_size > 0) {
+		delta = MIN(arc_mru->arcs_size, adjustment);
 		(void) arc_evict(arc_mru, 0, delta, FALSE, ARC_BUFC_DATA);
-		adjustment -= delta;
-	}
-
-	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
-		delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
-		(void) arc_evict(arc_mru, 0, delta, FALSE,
-		    ARC_BUFC_METADATA);
 	}
 
 	/*
@@ -2164,17 +2167,9 @@ arc_adjust(void)
 
 	adjustment = arc_size - arc_c;
 
-	if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
-		delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
+	if (adjustment > 0 && arc_mfu->arcs_size > 0) {
+		delta = MIN(arc_mfu->arcs_size, adjustment);
 		(void) arc_evict(arc_mfu, 0, delta, FALSE, ARC_BUFC_DATA);
-		adjustment -= delta;
-	}
-
-	if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
-		int64_t delta = MIN(adjustment,
-		    arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
-		(void) arc_evict(arc_mfu, 0, delta, FALSE,
-		    ARC_BUFC_METADATA);
 	}
 
 	/*
@@ -2752,6 +2747,8 @@ arc_get_data_buf(arc_buf_t *buf)
 	arc_state_t		*state = buf->b_hdr->b_state;
 	uint64_t		size = buf->b_hdr->b_size;
 	arc_buf_contents_t	type = buf->b_hdr->b_type;
+	arc_buf_contents_t	evict = ARC_BUFC_DATA;
+	boolean_t		recycle = TRUE;
 
 	arc_adapt(size, state);
 
@@ -2792,7 +2789,24 @@ arc_get_data_buf(arc_buf_t *buf)
 		    mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
 	}
 
-	if ((buf->b_data = arc_evict(state, 0, size, TRUE, type)) == NULL) {
+	/*
+	 * Evict data buffers prior to metadata buffers, unless we're
+	 * over the metadata limit and adding a metadata buffer.
+	 */
+	if (type == ARC_BUFC_METADATA) {
+		if (arc_meta_used >= arc_meta_limit)
+			evict = ARC_BUFC_METADATA;
+		else
+			/*
+			 * In this case, we're evicting data while
+			 * adding metadata. Thus, to prevent recycling a
+			 * data buffer into a metadata buffer, recycling
+			 * is disabled in the following arc_evict call.
+			 */
+			recycle = FALSE;
+	}
+
+	if ((buf->b_data = arc_evict(state, 0, size, recycle, evict)) == NULL) {
 		if (type == ARC_BUFC_METADATA) {
 			buf->b_data = zio_buf_alloc(size);
 			arc_space_consume(size, ARC_SPACE_DATA);
@@ -2803,8 +2817,10 @@ arc_get_data_buf(arc_buf_t *buf)
 			 * via the prune callback to drop references.  The
 			 * prune callback in run in the context of the reclaim
 			 * thread to avoid deadlocking on the hash_lock.
+			 * Of course, only do this when recycle is true.
 			 */
-			cv_signal(&arc_reclaim_thr_cv);
+			if (recycle)
+				cv_signal(&arc_reclaim_thr_cv);
 		} else {
 			ASSERT(type == ARC_BUFC_DATA);
 			buf->b_data = zio_data_buf_alloc(size);
@@ -2812,7 +2828,9 @@ arc_get_data_buf(arc_buf_t *buf)
 			atomic_add_64(&arc_size, size);
 		}
 
-		ARCSTAT_BUMP(arcstat_recycle_miss);
+		/* Only bump this if we tried to recycle and failed */
+		if (recycle)
+			ARCSTAT_BUMP(arcstat_recycle_miss);
 	}
 	ASSERT(buf->b_data != NULL);
 out:

From cc7f677c1610bd4ae11676ba02ec8987c347b83d Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Mon, 3 Feb 2014 12:41:47 -0800
Subject: [PATCH 10/11] Split "data_size" into "meta" and "data"

Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 include/sys/arc.h |  1 +
 module/zfs/arc.c  | 38 ++++++++++++++++++++++++--------------
 2 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/sys/arc.h b/include/sys/arc.h
index 5c8c1c1a3..005d07179 100644
--- a/include/sys/arc.h
+++ b/include/sys/arc.h
@@ -86,6 +86,7 @@ typedef enum arc_buf_contents {
  */
 typedef enum arc_space_type {
 	ARC_SPACE_DATA,
+	ARC_SPACE_META,
 	ARC_SPACE_HDRS,
 	ARC_SPACE_L2HDRS,
 	ARC_SPACE_OTHER,
diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 242d0c8c5..82b35bb62 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -308,6 +308,7 @@ typedef struct arc_stats {
 	kstat_named_t arcstat_size;
 	kstat_named_t arcstat_hdr_size;
 	kstat_named_t arcstat_data_size;
+	kstat_named_t arcstat_meta_size;
 	kstat_named_t arcstat_other_size;
 	kstat_named_t arcstat_anon_size;
 	kstat_named_t arcstat_anon_evict_data;
@@ -395,6 +396,7 @@ static arc_stats_t arc_stats = {
 	{ "size",			KSTAT_DATA_UINT64 },
 	{ "hdr_size",			KSTAT_DATA_UINT64 },
 	{ "data_size",			KSTAT_DATA_UINT64 },
+	{ "meta_size",			KSTAT_DATA_UINT64 },
 	{ "other_size",			KSTAT_DATA_UINT64 },
 	{ "anon_size",			KSTAT_DATA_UINT64 },
 	{ "anon_evict_data",		KSTAT_DATA_UINT64 },
@@ -1367,6 +1369,9 @@ arc_space_consume(uint64_t space, arc_space_type_t type)
 	case ARC_SPACE_DATA:
 		ARCSTAT_INCR(arcstat_data_size, space);
 		break;
+	case ARC_SPACE_META:
+		ARCSTAT_INCR(arcstat_meta_size, space);
+		break;
 	case ARC_SPACE_OTHER:
 		ARCSTAT_INCR(arcstat_other_size, space);
 		break;
@@ -1378,7 +1383,9 @@ arc_space_consume(uint64_t space, arc_space_type_t type)
 		break;
 	}
 
-	ARCSTAT_INCR(arcstat_meta_used, space);
+	if (type != ARC_SPACE_DATA)
+		ARCSTAT_INCR(arcstat_meta_used, space);
+
 	atomic_add_64(&arc_size, space);
 }
 
@@ -1393,6 +1400,9 @@ arc_space_return(uint64_t space, arc_space_type_t type)
 	case ARC_SPACE_DATA:
 		ARCSTAT_INCR(arcstat_data_size, -space);
 		break;
+	case ARC_SPACE_META:
+		ARCSTAT_INCR(arcstat_meta_size, -space);
+		break;
 	case ARC_SPACE_OTHER:
 		ARCSTAT_INCR(arcstat_other_size, -space);
 		break;
@@ -1404,10 +1414,13 @@ arc_space_return(uint64_t space, arc_space_type_t type)
 		break;
 	}
 
-	ASSERT(arc_meta_used >= space);
-	if (arc_meta_max < arc_meta_used)
-		arc_meta_max = arc_meta_used;
-	ARCSTAT_INCR(arcstat_meta_used, -space);
+	if (type != ARC_SPACE_DATA) {
+		ASSERT(arc_meta_used >= space);
+		if (arc_meta_max < arc_meta_used)
+			arc_meta_max = arc_meta_used;
+		ARCSTAT_INCR(arcstat_meta_used, -space);
+	}
+
 	ASSERT(arc_size >= space);
 	atomic_add_64(&arc_size, -space);
 }
@@ -1604,12 +1617,11 @@ arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
 		if (!recycle) {
 			if (type == ARC_BUFC_METADATA) {
 				arc_buf_data_free(buf, zio_buf_free);
-				arc_space_return(size, ARC_SPACE_DATA);
+				arc_space_return(size, ARC_SPACE_META);
 			} else {
 				ASSERT(type == ARC_BUFC_DATA);
 				arc_buf_data_free(buf, zio_data_buf_free);
-				ARCSTAT_INCR(arcstat_data_size, -size);
-				atomic_add_64(&arc_size, -size);
+				arc_space_return(size, ARC_SPACE_DATA);
 			}
 		}
 		if (list_link_active(&buf->b_hdr->b_arc_node)) {
@@ -2759,12 +2771,11 @@ arc_get_data_buf(arc_buf_t *buf)
 	if (!arc_evict_needed(type)) {
 		if (type == ARC_BUFC_METADATA) {
 			buf->b_data = zio_buf_alloc(size);
-			arc_space_consume(size, ARC_SPACE_DATA);
+			arc_space_consume(size, ARC_SPACE_META);
 		} else {
 			ASSERT(type == ARC_BUFC_DATA);
 			buf->b_data = zio_data_buf_alloc(size);
-			ARCSTAT_INCR(arcstat_data_size, size);
-			atomic_add_64(&arc_size, size);
+			arc_space_consume(size, ARC_SPACE_DATA);
 		}
 		goto out;
 	}
@@ -2809,7 +2820,7 @@ arc_get_data_buf(arc_buf_t *buf)
 	if ((buf->b_data = arc_evict(state, 0, size, recycle, evict)) == NULL) {
 		if (type == ARC_BUFC_METADATA) {
 			buf->b_data = zio_buf_alloc(size);
-			arc_space_consume(size, ARC_SPACE_DATA);
+			arc_space_consume(size, ARC_SPACE_META);
 
 			/*
 			 * If we are unable to recycle an existing meta buffer
@@ -2824,8 +2835,7 @@ arc_get_data_buf(arc_buf_t *buf)
 		} else {
 			ASSERT(type == ARC_BUFC_DATA);
 			buf->b_data = zio_data_buf_alloc(size);
-			ARCSTAT_INCR(arcstat_data_size, size);
-			atomic_add_64(&arc_size, size);
+			arc_space_consume(size, ARC_SPACE_DATA);
 		}
 
 		/* Only bump this if we tried to recycle and failed */

From 2b13331d62a58e3efaecd0b7ab884c45c52b8d3b Mon Sep 17 00:00:00 2001
From: Prakash Surya <surya1@llnl.gov>
Date: Mon, 3 Feb 2014 12:21:51 -0800
Subject: [PATCH 11/11] Set "arc_meta_limit" to 3/4 arc_c_max by default

Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
---
 module/zfs/arc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/module/zfs/arc.c b/module/zfs/arc.c
index 82b35bb62..ccc9510fd 100644
--- a/module/zfs/arc.c
+++ b/module/zfs/arc.c
@@ -4113,8 +4113,8 @@ arc_init(void)
 	arc_c = arc_c_max;
 	arc_p = (arc_c >> 1);
 
-	/* limit meta-data to 1/4 of the arc capacity */
-	arc_meta_limit = arc_c_max / 4;
+	/* limit meta-data to 3/4 of the arc capacity */
+	arc_meta_limit = (3 * arc_c_max) / 4;
 	arc_meta_max = 0;
 
 	/* Allow the tunable to override if it is reasonable */