mirror_zfs/include/sys/dmu_zfetch.h

89 lines
2.7 KiB
C
Raw Normal View History

2008-11-20 23:01:55 +03:00
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or https://opensource.org/licenses/CDDL-1.0.
2008-11-20 23:01:55 +03:00
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
2008-11-20 23:01:55 +03:00
* Use is subject to license terms.
*/
/*
* Copyright (c) 2014, 2017 by Delphix. All rights reserved.
*/
#ifndef _DMU_ZFETCH_H
#define _DMU_ZFETCH_H
2008-11-20 23:01:55 +03:00
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
Cleanup: 64-bit kernel module parameters should use fixed width types Various module parameters such as `zfs_arc_max` were originally `uint64_t` on OpenSolaris/Illumos, but were changed to `unsigned long` for Linux compatibility because Linux's kernel default module parameter implementation did not support 64-bit types on 32-bit platforms. This caused problems when porting OpenZFS to Windows because its LLP64 memory model made `unsigned long` a 32-bit type on 64-bit, which created the undesireable situation that parameters that should accept 64-bit values could not on 64-bit Windows. Upon inspection, it turns out that the Linux kernel module parameter interface is extensible, such that we are allowed to define our own types. Rather than maintaining the original type change via hacks to to continue shrinking module parameters on 32-bit Linux, we implement support for 64-bit module parameters on Linux. After doing a review of all 64-bit kernel parameters (found via the man page and also proposed changes by Andrew Innes), the kernel module parameters fell into a few groups: Parameters that were originally 64-bit on Illumos: * dbuf_cache_max_bytes * dbuf_metadata_cache_max_bytes * l2arc_feed_min_ms * l2arc_feed_secs * l2arc_headroom * l2arc_headroom_boost * l2arc_write_boost * l2arc_write_max * metaslab_aliquot * metaslab_force_ganging * zfetch_array_rd_sz * zfs_arc_max * zfs_arc_meta_limit * zfs_arc_meta_min * zfs_arc_min * zfs_async_block_max_blocks * zfs_condense_max_obsolete_bytes * zfs_condense_min_mapping_bytes * zfs_deadman_checktime_ms * zfs_deadman_synctime_ms * zfs_initialize_chunk_size * zfs_initialize_value * zfs_lua_max_instrlimit * zfs_lua_max_memlimit * zil_slog_bulk Parameters that were originally 32-bit on Illumos: * zfs_per_txg_dirty_frees_percent Parameters that were originally `ssize_t` on Illumos: * zfs_immediate_write_sz Note that `ssize_t` is `int32_t` on 32-bit and `int64_t` on 64-bit. It has been upgraded to 64-bit. Parameters that were `long`/`unsigned long` because of Linux/FreeBSD influence: * l2arc_rebuild_blocks_min_l2size * zfs_key_max_salt_uses * zfs_max_log_walking * zfs_max_logsm_summary_length * zfs_metaslab_max_size_cache_sec * zfs_min_metaslabs_to_flush * zfs_multihost_interval * zfs_unflushed_log_block_max * zfs_unflushed_log_block_min * zfs_unflushed_log_block_pct * zfs_unflushed_max_mem_amt * zfs_unflushed_max_mem_ppm New parameters that do not exist in Illumos: * l2arc_trim_ahead * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_arc_sys_free * zfs_deadman_ziotime_ms * zfs_delete_blocks * zfs_history_output_max * zfs_livelist_max_entries * zfs_max_async_dedup_frees * zfs_max_nvlist_src_size * zfs_rebuild_max_segment * zfs_rebuild_vdev_limit * zfs_unflushed_log_txg_max * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift * zfs_vnops_read_chunk_size * zvol_max_discard_blocks Rather than clutter the lists with commentary, the module parameters that need comments are repeated below. A few parameters were defined in Linux/FreeBSD specific code, where the use of ulong/long is not an issue for portability, so we leave them alone: * zfs_delete_blocks * zfs_key_max_salt_uses * zvol_max_discard_blocks The documentation for a few parameters was found to be incorrect: * zfs_deadman_checktime_ms - incorrectly documented as int * zfs_delete_blocks - not documented as Linux only * zfs_history_output_max - incorrectly documented as int * zfs_vnops_read_chunk_size - incorrectly documented as long * zvol_max_discard_blocks - incorrectly documented as ulong The documentation for these has been fixed, alongside the changes to document the switch to fixed width types. In addition, several kernel module parameters were percentages or held ashift values, so being 64-bit never made sense for them. They have been downgraded to 32-bit: * vdev_file_logical_ashift * vdev_file_physical_ashift * zfs_arc_dnode_limit_percent * zfs_arc_dnode_reduce_percent * zfs_arc_meta_limit_percent * zfs_per_txg_dirty_frees_percent * zfs_unflushed_log_block_pct * zfs_vdev_max_auto_ashift * zfs_vdev_min_auto_ashift Of special note are `zfs_vdev_max_auto_ashift` and `zfs_vdev_min_auto_ashift`, which were already defined as `uint64_t`, and passed to the kernel as `ulong`. This is inherently buggy on big endian 32-bit Linux, since the values would not be written to the correct locations. 32-bit FreeBSD was unaffected because its sysctl code correctly treated this as a `uint64_t`. Lastly, a code comment suggests that `zfs_arc_sys_free` is Linux-specific, but there is nothing to indicate to me that it is Linux-specific. Nothing was done about that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Original-patch-by: Andrew Innes <andrew.c12@gmail.com> Original-patch-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #13984 Closes #14004
2022-10-03 22:06:54 +03:00
extern uint64_t zfetch_array_rd_sz;
2008-11-20 23:01:55 +03:00
struct dnode; /* so we can reference dnode */
typedef struct zfetch {
kmutex_t zf_lock; /* protects zfetch structure */
list_t zf_stream; /* list of zstream_t's */
struct dnode *zf_dnode; /* dnode that owns this zfetch */
int zf_numstreams; /* number of zstream_t's */
} zfetch_t;
2008-11-20 23:01:55 +03:00
typedef struct zstream {
uint64_t zs_blkid; /* expect next access at this blkid */
unsigned int zs_pf_dist; /* data prefetch distance in bytes */
unsigned int zs_ipf_dist; /* L1 prefetch distance in bytes */
uint64_t zs_pf_start; /* first data block to prefetch */
uint64_t zs_pf_end; /* data block to prefetch up to */
uint64_t zs_ipf_start; /* first data block to prefetch L1 */
uint64_t zs_ipf_end; /* data block to prefetch L1 up to */
list_node_t zs_node; /* link for zf_stream */
Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652
2021-03-20 08:56:11 +03:00
hrtime_t zs_atime; /* time last prefetch issued */
zfetch_t *zs_fetch; /* parent fetch */
Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652
2021-03-20 08:56:11 +03:00
boolean_t zs_missed; /* stream saw cache misses */
boolean_t zs_more; /* need more distant prefetch */
Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652
2021-03-20 08:56:11 +03:00
zfs_refcount_t zs_callers; /* number of pending callers */
/*
* Number of stream references: dnode, callers and pending blocks.
* The stream memory is freed when the number returns to zero.
*/
zfs_refcount_t zs_refs;
2008-11-20 23:01:55 +03:00
} zstream_t;
void zfetch_init(void);
void zfetch_fini(void);
2008-11-20 23:01:55 +03:00
void dmu_zfetch_init(zfetch_t *, struct dnode *);
void dmu_zfetch_fini(zfetch_t *);
Split dmu_zfetch() speculation and execution parts To make better predictions on parallel workloads dmu_zfetch() should be called as early as possible to reduce possible request reordering. In particular, it should be called before dmu_buf_hold_array_by_dnode() calls dbuf_hold(), which may sleep waiting for indirect blocks, waking up multiple threads same time on completion, that can significantly reorder the requests, making the stream look like random. But we should not issue prefetch requests before the on-demand ones, since they may get to the disks first despite the I/O scheduler, increasing on-demand request latency. This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare() and dmu_zfetch_run(). The first can be executed as early as needed. It only updates statistics and makes predictions without issuing any I/Os. The I/O issuance is handled by dmu_zfetch_run(), which can be called later when all on-demand I/Os are already issued. It even tracks the activity of other concurrent threads, issuing the prefetch only when _all_ on-demand requests are issued. For many years it was a big problem for storage servers, handling deeper request queues from their clients, having to either serialize consequential reads to make ZFS prefetcher usable, or execute the incoming requests as-is and get almost no prefetch from ZFS, relying only on deep enough prefetch by the clients. Benefits of those ways varied, but neither was perfect. With this patch deeper queue sequential read benchmarks with CrystalDiskMark from Windows via iSCSI to FreeBSD target show me much better throughput with almost 100% prefetcher hit rate, comparing to almost zero before. While there, I also removed per-stream zs_lock as useless, completely covered by parent zf_lock. Also I reused zs_blocks refcount to track zf_stream linkage of the stream, since I believe previous zs_fetch == NULL check in dmu_zfetch_stream_done() was racy. Delete prefetch streams when they reach ends of files. It saves up to 1KB of RAM per file, plus reduces searches through the stream list. Block data prefetch (speculation and indirect block prefetch is still done since they are cheaper) if all dbufs of the stream are already in DMU cache. First cache miss immediately fires all the prefetch that would be done for the stream by that time. It saves some CPU time if same files within DMU cache capacity are read over and over. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11652
2021-03-20 08:56:11 +03:00
zstream_t *dmu_zfetch_prepare(zfetch_t *, uint64_t, uint64_t, boolean_t,
boolean_t);
void dmu_zfetch_run(zstream_t *, boolean_t, boolean_t);
void dmu_zfetch(zfetch_t *, uint64_t, uint64_t, boolean_t, boolean_t,
boolean_t);
2008-11-20 23:01:55 +03:00
#ifdef __cplusplus
}
#endif
#endif /* _DMU_ZFETCH_H */