Unified allocation throttling (#17020)

Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device. This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run. This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space. This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2026-05-23 19:04:45 +03:00 · 2025-03-24 12:25:01 -04:00
parent 3862ebbf1f
commit 94a3fabcb0
12 changed files with 536 additions and 786 deletions
@@ -75,18 +75,13 @@ uint64_t metaslab_largest_allocatable(metaslab_t *);
 /*
 * metaslab alloc flags
 */
-#define	METASLAB_HINTBP_FAVOR		0x0
-#define	METASLAB_HINTBP_AVOID		0x1
+#define	METASLAB_ZIL			0x1
 #define	METASLAB_GANG_HEADER		0x2
 #define	METASLAB_GANG_CHILD		0x4
 #define	METASLAB_ASYNC_ALLOC		0x8
-#define	METASLAB_DONT_THROTTLE		0x10
-#define	METASLAB_MUST_RESERVE		0x20
-#define	METASLAB_ZIL			0x80

-int metaslab_alloc(spa_t *, metaslab_class_t *, uint64_t,
-    blkptr_t *, int, uint64_t, blkptr_t *, int, zio_alloc_list_t *, zio_t *,
-	int);
+int metaslab_alloc(spa_t *, metaslab_class_t *, uint64_t, blkptr_t *, int,
+    uint64_t, blkptr_t *, int, zio_alloc_list_t *, int, const void *);
 int metaslab_alloc_dva(spa_t *, metaslab_class_t *, uint64_t,
    dva_t *, int, dva_t *, uint64_t, int, zio_alloc_list_t *, int);
 void metaslab_free(spa_t *, const blkptr_t *, uint64_t, boolean_t);
@@ -103,15 +98,17 @@ void metaslab_stat_fini(void);
 void metaslab_trace_init(zio_alloc_list_t *);
 void metaslab_trace_fini(zio_alloc_list_t *);

-metaslab_class_t *metaslab_class_create(spa_t *, const metaslab_ops_t *);
+metaslab_class_t *metaslab_class_create(spa_t *, const metaslab_ops_t *,
+    boolean_t);
 void metaslab_class_destroy(metaslab_class_t *);
-int metaslab_class_validate(metaslab_class_t *);
+void metaslab_class_validate(metaslab_class_t *);
+void metaslab_class_balance(metaslab_class_t *mc, boolean_t onsync);
 void metaslab_class_histogram_verify(metaslab_class_t *);
 uint64_t metaslab_class_fragmentation(metaslab_class_t *);
 uint64_t metaslab_class_expandable_space(metaslab_class_t *);
-boolean_t metaslab_class_throttle_reserve(metaslab_class_t *, int, int,
-    zio_t *, int);
-void metaslab_class_throttle_unreserve(metaslab_class_t *, int, int, zio_t *);
+boolean_t metaslab_class_throttle_reserve(metaslab_class_t *, int, zio_t *,
+    boolean_t, boolean_t *);
+boolean_t metaslab_class_throttle_unreserve(metaslab_class_t *, int, zio_t *);
 void metaslab_class_evict_old(metaslab_class_t *, uint64_t);
 uint64_t metaslab_class_get_alloc(metaslab_class_t *);
 uint64_t metaslab_class_get_space(metaslab_class_t *);
@@ -130,9 +127,8 @@ uint64_t metaslab_group_get_space(metaslab_group_t *);
 void metaslab_group_histogram_verify(metaslab_group_t *);
 uint64_t metaslab_group_fragmentation(metaslab_group_t *);
 void metaslab_group_histogram_remove(metaslab_group_t *, metaslab_t *);
-void metaslab_group_alloc_decrement(spa_t *, uint64_t, const void *, int, int,
-    boolean_t);
-void metaslab_group_alloc_verify(spa_t *, const blkptr_t *, const void *, int);
+void metaslab_group_alloc_decrement(spa_t *, uint64_t, int, int, uint64_t,
+    const void *);
 void metaslab_recalculate_weight_and_sort(metaslab_t *);
 void metaslab_disable(metaslab_t *);
 void metaslab_enable(metaslab_t *, boolean_t, boolean_t);