The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex. Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#9062
Currently, sequential async write workloads spend a lot of time
contending on the dn_struct_rwlock. This lock is responsible for
protecting the entire block tree below it; this naturally results
in some serialization during heavy write workloads. This can be
resolved by having per-dbuf locking, which will allow multiple
writers in the same object at the same time.
We introduce a new rwlock, the db_rwlock. This lock is responsible
for protecting the contents of the dbuf that it is a part of; when
reading a block pointer from a dbuf, you hold the lock as a reader.
When writing data to a dbuf, you hold it as a writer. This allows
multiple threads to write to different parts of a file at the same
time.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens matt@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-52564
External-issue: DLPX-53085
External-issue: DLPX-57384
Closes#8946
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.
Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413Closes#5040
Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'
- Difference from upstream in module/zfs/dmu_zfetch.c,
uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance
- Variables have been initialized at the beginning of the function
(void dmu_zfetch) to resemble the order of occurrence and account
for C99, C11 mode errors.
The DMU zfetch code organizes streams with lists not avl trees. A
avl_node_t was mistakenly used for a list_node_t in the zstream_t
type. This is incorrect (but harmless) and when unnoticed because:
1) The list functions explicitly cast the value preventing a warning,
2) sizeof(avl_node_t) >= sizeof(list_node_t) so no overrun occurs, and
3) The calculated offset is the same regardless of the type.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1946
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.