Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold

Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.

Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8859
This commit is contained in:
Serapheim Dimitropoulos 2019-06-06 13:08:41 -07:00 committed by Brian Behlendorf
parent 627f5117a3
commit cb020f0d86
2 changed files with 21 additions and 6 deletions

View File

@ -1817,7 +1817,7 @@ this value. If a metaslab group exceeds this threshold then it will be
skipped unless all metaslab groups within the metaslab class have also skipped unless all metaslab groups within the metaslab class have also
crossed this threshold. crossed this threshold.
.sp .sp
Default value: \fB85\fR. Default value: \fB95\fR.
.RE .RE
.sp .sp

View File

@ -103,12 +103,27 @@ int zfs_mg_noalloc_threshold = 0;
/* /*
* Metaslab groups are considered eligible for allocations if their * Metaslab groups are considered eligible for allocations if their
* fragmenation metric (measured as a percentage) is less than or equal to * fragmenation metric (measured as a percentage) is less than or
* zfs_mg_fragmentation_threshold. If a metaslab group exceeds this threshold * equal to zfs_mg_fragmentation_threshold. If a metaslab group
* then it will be skipped unless all metaslab groups within the metaslab * exceeds this threshold then it will be skipped unless all metaslab
* class have also crossed this threshold. * groups within the metaslab class have also crossed this threshold.
*
* This tunable was introduced to avoid edge cases where we continue
* allocating from very fragmented disks in our pool while other, less
* fragmented disks, exists. On the other hand, if all disks in the
* pool are uniformly approaching the threshold, the threshold can
* be a speed bump in performance, where we keep switching the disks
* that we allocate from (e.g. we allocate some segments from disk A
* making it bypassing the threshold while freeing segments from disk
* B getting its fragmentation below the threshold).
*
* Empirically, we've seen that our vdev selection for allocations is
* good enough that fragmentation increases uniformly across all vdevs
* the majority of the time. Thus we set the threshold percentage high
* enough to avoid hitting the speed bump on pools that are being pushed
* to the edge.
*/ */
int zfs_mg_fragmentation_threshold = 85; int zfs_mg_fragmentation_threshold = 95;
/* /*
* Allow metaslabs to keep their active state as long as their fragmentation * Allow metaslabs to keep their active state as long as their fragmentation