Only examine best metaslabs on each vdev

On a system with very high fragmentation, we may need to do lots of gang
allocations (e.g. most indirect block allocations (~50KB) may need to
gang). Before failing a "normal" allocation and resorting to ganging, we
try every metaslab.  This has the impact of loading every metaslab (not
a huge deal since we now typically keep all metaslabs loaded), and also
iterating over every metaslab for every failing allocation. If there are
many metaslabs (more than the typical ~200, e.g. due to vdev expansion
or very large vdevs), the CPU cost of this iteration can be very
impactful.  This iteration is done with the mg_lock held, creating long
hold times and high lock contention for concurrent allocations,
ultimately causing long txg sync times and poor application performance.

To address this, this commit changes the behavior of "normal" (not
try_hard, not ZIL) allocations.  These will now only examine the 100
best metaslabs (as determined by their ms_weight).  If none of these
have a large enough free segment, then the allocation will fail and
we'll fall back on ganging.

To accomplish this, we will now (normally) gang before doing a
`try_hard` allocation.  Non-try_hard allocations will only examine the
100 best metaslabs of each vdev.  In summary, we will first try normal
allocation.  If that fails then we will do a gang allocation.  If that
fails then we will do a "try hard" gang allocation.  If that fails then
we will have a multi-layer gang block.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11327
This commit is contained in:
Matthew Ahrens
2020-12-16 14:40:05 -08:00
committed by GitHub
parent f8020c9363
commit be5c6d9653
4 changed files with 95 additions and 55 deletions
+34
View File
@@ -526,6 +526,40 @@ memory that is the threshold.
Default value: \fB25 percent\fR
.RE
.sp
.ne 2
.na
\fBzfs_metaslab_try_hard_before_gang\fR (int)
.ad
.RS 12n
If not set (the default), we will first try normal allocation.
If that fails then we will do a gang allocation.
If that fails then we will do a "try hard" gang allocation.
If that fails then we will have a multi-layer gang block.
.sp
If set, we will first try normal allocation.
If that fails then we will do a "try hard" allocation.
If that fails we will do a gang allocation.
If that fails we will do a "try hard" gang allocation.
If that fails then we will have a multi-layer gang block.
.sp
Default value: \fB0 (false)\fR
.RE
.sp
.ne 2
.na
\fBzfs_metaslab_find_max_tries\fR (int)
.ad
.RS 12n
When not trying hard, we only consider this number of the best metaslabs.
This improves performance, especially when there are many metaslabs per vdev
and the allocation can't actually be satisfied (so we would otherwise iterate
all the metaslabs).
.sp
Default value: \fB100\fR
.RE
.sp
.ne 2
.na