Unified allocation throttling (#17020)

Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device. This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run. This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space. This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2026-05-24 11:18:52 +03:00 · 2025-03-24 12:25:01 -04:00
parent 3862ebbf1f
commit 94a3fabcb0
12 changed files with 536 additions and 786 deletions
@@ -141,23 +141,24 @@ typedef enum trace_alloc_type {
 * Per-allocator data structure.
 */
 typedef struct metaslab_class_allocator {
+	kmutex_t		mca_lock;
+	avl_tree_t		mca_tree;
+
 	metaslab_group_t	*mca_rotor;
 	uint64_t		mca_aliquot;

 	/*
 	 * The allocation throttle works on a reservation system. Whenever
 	 * an asynchronous zio wants to perform an allocation it must
-	 * first reserve the number of blocks that it wants to allocate.
+	 * first reserve the number of bytes that it wants to allocate.
 	 * If there aren't sufficient slots available for the pending zio
 	 * then that I/O is throttled until more slots free up. The current
-	 * number of reserved allocations is maintained by the mca_alloc_slots
-	 * refcount. The mca_alloc_max_slots value determines the maximum
-	 * number of allocations that the system allows. Gang blocks are
-	 * allowed to reserve slots even if we've reached the maximum
-	 * number of allocations allowed.
+	 * size of reserved allocations is maintained by mca_reserved.
+	 * The maximum total size of reserved allocations is determined by
+	 * mc_alloc_max in the metaslab_class_t.  Gang blocks are allowed
+	 * to reserve for their headers even if we've reached the maximum.
 	 */
-	uint64_t		mca_alloc_max_slots;
-	zfs_refcount_t		mca_alloc_slots;
+	uint64_t		mca_reserved;
 } ____cacheline_aligned metaslab_class_allocator_t;

 /*
@@ -190,10 +191,10 @@ struct metaslab_class {
 	 */
 	uint64_t		mc_groups;

-	/*
-	 * Toggle to enable/disable the allocation throttle.
-	 */
+	boolean_t		mc_is_log;
 	boolean_t		mc_alloc_throttle_enabled;
+	uint64_t		mc_alloc_io_size;
+	uint64_t		mc_alloc_max;

 	uint64_t		mc_alloc_groups; /* # of allocatable groups */

@@ -216,11 +217,10 @@ struct metaslab_class {
 * Per-allocator data structure.
 */
 typedef struct metaslab_group_allocator {
-	uint64_t	mga_cur_max_alloc_queue_depth;
-	zfs_refcount_t	mga_alloc_queue_depth;
+	zfs_refcount_t	mga_queue_depth;
 	metaslab_t	*mga_primary;
 	metaslab_t	*mga_secondary;
-} metaslab_group_allocator_t;
+} ____cacheline_aligned metaslab_group_allocator_t;

 /*
 * Metaslab groups encapsulate all the allocatable regions (i.e. metaslabs)
@@ -235,6 +235,7 @@ struct metaslab_group {
 	kmutex_t		mg_lock;
 	avl_tree_t		mg_metaslab_tree;
 	uint64_t		mg_aliquot;
+	uint64_t		mg_queue_target;
 	boolean_t		mg_allocatable;		/* can we allocate? */
 	uint64_t		mg_ms_ready;

@@ -246,40 +247,12 @@ struct metaslab_group {
 	 */
 	boolean_t		mg_initialized;

-	uint64_t		mg_free_capacity;	/* percentage free */
-	int64_t			mg_bias;
 	int64_t			mg_activation_count;
 	metaslab_class_t	*mg_class;
 	vdev_t			*mg_vd;
 	metaslab_group_t	*mg_prev;
 	metaslab_group_t	*mg_next;

-	/*
-	 * In order for the allocation throttle to function properly, we cannot
-	 * have too many IOs going to each disk by default; the throttle
-	 * operates by allocating more work to disks that finish quickly, so
-	 * allocating larger chunks to each disk reduces its effectiveness.
-	 * However, if the number of IOs going to each allocator is too small,
-	 * we will not perform proper aggregation at the vdev_queue layer,
-	 * also resulting in decreased performance. Therefore, we will use a
-	 * ramp-up strategy.
-	 *
-	 * Each allocator in each metaslab group has a current queue depth
-	 * (mg_alloc_queue_depth[allocator]) and a current max queue depth
-	 * (mga_cur_max_alloc_queue_depth[allocator]), and each metaslab group
-	 * has an absolute max queue depth (mg_max_alloc_queue_depth).  We
-	 * add IOs to an allocator until the mg_alloc_queue_depth for that
-	 * allocator hits the cur_max. Every time an IO completes for a given
-	 * allocator on a given metaslab group, we increment its cur_max until
-	 * it reaches mg_max_alloc_queue_depth. The cur_max resets every txg to
-	 * help protect against disks that decrease in performance over time.
-	 *
-	 * It's possible for an allocator to handle more allocations than
-	 * its max. This can occur when gang blocks are required or when other
-	 * groups are unable to handle their share of allocations.
-	 */
-	uint64_t		mg_max_alloc_queue_depth;
-
 	/*
 	 * A metalab group that can no longer allocate the minimum block
 	 * size will set mg_no_free_space. Once a metaslab group is out
@@ -288,8 +261,6 @@ struct metaslab_group {
 	 */
 	boolean_t		mg_no_free_space;

-	uint64_t		mg_allocations;
-	uint64_t		mg_failed_allocations;
 	uint64_t		mg_fragmentation;
 	uint64_t		mg_histogram[ZFS_RANGE_TREE_HISTOGRAM_SIZE];

@@ -508,7 +479,7 @@ struct metaslab {
 	 */
 	hrtime_t	ms_load_time;	/* time last loaded */
 	hrtime_t	ms_unload_time;	/* time last unloaded */
-	hrtime_t	ms_selected_time; /* time last allocated from */
+	uint64_t	ms_selected_time; /* time last allocated from (secs) */

 	uint64_t	ms_alloc_txg;	/* last successful alloc (debug only) */
 	uint64_t	ms_max_size;	/* maximum allocatable size	*/