Commit Graph

48 Commits

Author SHA1 Message Date
Alexander Motin
1af240f3b5 Add separate aggregation limit for non-rotating media
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, supposedly to reduce number of
head seeks for spinning disks.  But for SSDs it makes little to no
sense, especially on FreeBSD, where due to MAXPHYS limitation device
will likely still see bunch of 128KB I/Os instead of one large.
Having more strict aggregation limit for SSDs allows to avoid
allocation of large memory buffer and copy to/from it, that is a
serious problem when throughput reaches gigabytes per second.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8494
2019-03-13 12:00:10 -07:00
George Wilson
619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
John Gallagher
d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Paul Dagnelie
492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Tony Hutter
c26cf0966d Fix zio->io_priority failed (7 < 6) assert
This fixes an assert in vdev_queue_change_io_priority():

  VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
  PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7566 
Closes #7542
2018-05-29 18:13:48 -07:00
Brian Behlendorf
93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Brian Behlendorf
a76f3d0437
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock.  This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed.  This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7307
2018-03-16 16:46:06 -07:00
Tom Caputi
a8b2e30685 Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.

This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921 
Closes #6926
2017-12-21 09:13:06 -08:00
Tom Caputi
d4a72f2386 Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #3625 
Closes #6256
2017-11-15 17:27:01 -08:00
Don Brady
1c27024e22 Undo c89 workarounds to match with upstream
With PR 5756 the zfs module now supports c99 and the
remaining past c89 workarounds can be undone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6816
2017-11-04 13:25:13 -07:00
Gvozden Neskovic
d6c6590c5d vdev_mirror: load balancing fixes
vdev_queue:
- Track the last position of each vdev, including the io size,
  in order to detect linear access of the following zio.
- Remove duplicate `vq_lastoffset`

vdev_mirror:
- Correctly calculate the zio offset (signedness issue)
- Deprecate `vdev_queue_register_lastoffset()`
- Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:16 -07:00
Brian Behlendorf
2d678f779a Cap maximum aggregate IO size
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit.  Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259 
Closes #6270
2017-06-27 10:09:16 -07:00
DHE
06226b5936 Increase zfs_vdev_async_write_min_active to 2
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #4825
Closes #5926
2017-04-14 14:03:44 -07:00
Matthew Ahrens
8542ef852a OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations
Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Matt Ahrens <mahrens@delphix.com>

RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity.  Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4.  A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.

To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks.  The contents of these pad sectors are not used, so we do not
need to read or write these sectors.  However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors.  If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance.  The gap-filling performance improvement was introduced in
July 2009.

Writing the optional zio is done by the io aggregation code in
vdev_queue.c.  The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB.  For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.

The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks).  It
cannot occur with the default recordsize=128KB.  If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.

The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors).  Therefore
this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem".

The problem also occurs with the following configurations:

With recordsize=512KB or 256KB, compression=off, the problem occurs only
in rarely-used configurations:
* 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors)
* 4-wide RAIDZ2 (either recordsize, either ashift)
* 5-wide RAIDZ2 with recordsize=512KB (either ashift)
* 6-wide RAIDZ2 with recordsize=512KB (either ashift)

With recordsize=1MB, compression=off, ashift=9 (512B sectors)
* RAIDZ1 with 4 or 8 disks
* RAIDZ2 with 4, 8, or 10 disks
* RAIDZ3 with 6, 8, 9, or 10 disks

With recordsize=1MB, compression=off, ashift=12 (4KB sectors)
* RAIDZ1 with 7 or 8 disks
* RAIDZ2 with 4, 5, or 10 disks
* RAIDZ3 with 6, 9, or 10 disks

With recordsize=2MB and larger (which can only be selected by changing
kernel tunables), many configurations are affected, including with
higher numbers of disks (up to 18 disks with recordsize=2MB).

Increase zfs_vdev_aggregation_limit to allow the optional zio to be
aggregated, thus eliminating the problem.  Setting it to 256KB fixes all
commonly-used configurations.

The solution is to aggregate optional zio's regardless of the
aggregation size limit.

Analysis sponsored by Intel Corp.

OpenZFS-issue: https://www.illumos.org/issues/8005
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321
Closes #5931
2017-04-10 15:21:45 -07:00
Brian Behlendorf
02730c333c Use cstyle -cpP in make cstyle check
Enable picky cstyle checks and resolve the new warnings.  The vast
majority of the changes needed were to handle minor issues with
whitespace formatting.  This patch contains no functional changes.

Non-whitespace changes are as follows:

* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5465
2016-12-12 10:46:26 -08:00
David Quigley
a6255b7fce DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
Don Brady
3dfb57a35e OpenZFS 7090 - zfs should throttle allocations
OpenZFS 7090 - zfs should throttle allocations

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7
Closes #5258 

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
2016-10-13 17:59:18 -07:00
Gvozden Neskovic
ee36c709c3 Performance optimization of AVL tree comparator functions
perf: 2.75x faster ddt_entry_compare()
    First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently.  Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old	6.85789 s
new	2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
    Compute the result directly instead of using conditionals

perf: zfs_range_compare()
    Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
    `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033
2016-08-31 14:35:34 -07:00
GeLiXin
b7faa7aabd Fix self-healing IO prior to dsl_pool_init() completion
Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <47034221@qq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4652
2016-05-27 14:11:25 -07:00
Tony Hutter
26ef0cc7db OpenZFS 6531 - Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
2016-05-26 10:11:51 -07:00
smh
9f500936c8 FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4334
2016-02-26 11:24:35 -08:00
Brian Behlendorf
a58df6f536 Fix zfs_vdev_aggregation_limit bounds checking
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-18 13:32:06 -08:00
Brian Behlendorf
6fe53787f3 Fix vdev_queue_aggregate() deadlock
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3808
Closes #3867
2015-12-18 13:27:12 -08:00
loli10K
3757bff3b1 Fix small typo
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3714
2015-08-30 10:10:16 -07:00
Tim Chase
36b454ab4c Initialize the taskq entry embedded within struct vdev
As part of the stack reduction effort in
50b25b2187, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object.  The rationale is the same as is
described in f467b05a26.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3709
2015-08-30 10:04:56 -07:00
Matthew Ahrens
f1512ee61e Illumos 5027 - zfs large block support
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
2015-05-11 12:23:16 -07:00
Justin T. Gibbs
ec8501ee12 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5313
  https://github.com/illumos/illumos-gate/commit/fe319232

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3280
2015-04-24 15:16:56 -07:00
Brian Behlendorf
285b29d959 Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Alex Reece
acbad6ff67 Illumos 4753 - increase number of outstanding async writes when sync task is waiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/4753
    https://github.com/illumos/illumos-gate/commit/73527f4

Comments by Matt Ahrens from the issue tracker:
    When a sync task is waiting for a txg to complete, we should hurry
    it along by increasing the number of outstanding async writes
    (i.e. make vdev_queue_max_async_writes() return a larger number).
    Initially we might just have a tunable for "minimum async writes
    while a synctask is waiting" and set it to 3.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2716
2014-09-23 13:50:55 -07:00
Brian Behlendorf
50b25b2187 Avoid dynamic allocation of 'search zio'
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage.  Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.

To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure.  All
access must be serialized through the vq_lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2572
2014-08-11 08:44:54 -07:00
Michael Kjorling
d1d7e2689d cstyle: Resolve C style issues
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821
2013-12-18 16:46:35 -08:00
Matthew Ahrens
e8b96c6007 Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver.  The scheduler
issues a number of concurrent i/os from each class to the device.  Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes).  The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is.  See the block comment in vdev_queue.c (reproduced
below) for more details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load.  The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system.  When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount.  This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens.  One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes.  See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently.  There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due".  One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os.  This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future.  If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due.  Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).

Notes on porting to ZFS on Linux:

- zio_t gained new members io_physdone and io_phys_children.  Because
  object caches in the Linux port call the constructor only once at
  allocation time, objects may contain residual data when retrieved
  from the cache. Therefore zio_create() was updated to zero out the two
  new fields.

- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
  (vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
  This tree has been replaced by vq->vq_active_tree which is now used
  for the same purpose.

- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
  the number of vdev I/O buffers to pre-allocate.  That global no longer
  exists, so we instead use the sum of the *_max_active values for each of
  the five I/O classes described above.

- The Illumos implementation of dmu_tx_delay() delays a transaction by
  sleeping in condition variable embedded in the thread
  (curthread->t_delay_cv).  We do not have an equivalent CV to use in
  Linux, so this change replaced the delay logic with a wrapper called
  zfs_sleep_until(). This wrapper could be adopted upstream and in other
  downstream ports to abstract away operating system-specific delay logic.

- These tunables are added as module parameters, and descriptions added
  to the zfs-module-parameters.5 man page.

  spa_asize_inflation
  zfs_deadman_synctime_ms
  zfs_vdev_max_active
  zfs_vdev_async_write_active_min_dirty_percent
  zfs_vdev_async_write_active_max_dirty_percent
  zfs_vdev_async_read_max_active
  zfs_vdev_async_read_min_active
  zfs_vdev_async_write_max_active
  zfs_vdev_async_write_min_active
  zfs_vdev_scrub_max_active
  zfs_vdev_scrub_min_active
  zfs_vdev_sync_read_max_active
  zfs_vdev_sync_read_min_active
  zfs_vdev_sync_write_max_active
  zfs_vdev_sync_write_min_active
  zfs_dirty_data_max_percent
  zfs_delay_min_dirty_percent
  zfs_dirty_data_max_max_percent
  zfs_dirty_data_max
  zfs_dirty_data_max_max
  zfs_dirty_data_sync
  zfs_delay_scale

  The latter four have type unsigned long, whereas they are uint64_t in
  Illumos.  This accommodates Linux's module_param() supported types, but
  means they may overflow on 32-bit architectures.

  The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
  likely to overflow on 32-bit systems, since they express physical RAM
  sizes in bytes.  In fact, Illumos initializes zfs_dirty_data_max_max to
  2^32 which does overflow. To resolve that, this port instead initializes
  it in arc_init() to 25% of physical RAM, and adds the tunable
  zfs_dirty_data_max_max_percent to override that percentage.  While this
  solution doesn't completely avoid the overflow issue, it should be a
  reasonable default for most systems, and the minority of affected
  systems can work around the issue by overriding the defaults.

- Fixed reversed logic in comment above zfs_delay_scale declaration.

- Clarified comments in vdev_queue.c regarding when per-queue minimums take
  effect.

- Replaced dmu_tx_write_limit in the dmu_tx kstat file
  with dmu_tx_dirty_delay and dmu_tx_dirty_over_max.  The first counts
  how many times a transaction has been delayed because the pool dirty
  data has exceeded zfs_delay_min_dirty_percent.  The latter counts how
  many times the pool dirty data has exceeded zfs_dirty_data_max (which
  we expect to never happen).

- The original patch would have regressed the bug fixed in
  zfsonlinux/zfs@c418410, which prevented users from setting the
  zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
  A similar fix is added to vdev_queue_aggregate().

- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
  heap instead of the stack.  In Linux we can't afford such large
  structures on the stack.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  http://www.illumos.org/issues/4045
  illumos/illumos-gate@69962b5647

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-12-06 09:32:43 -08:00
Will Andrews
d3cc8b152e Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f717074149

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.
2013-11-04 10:55:25 -08:00
Matthew Ahrens
330847ff36 Illumos #3537
3537 want pool io kstats

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  http://www.illumos.org/issues/3537
  illumos/illumos-gate@c3a6601

Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:

1. The patch was restructured to take advantage of the existing
   spa statistics infrastructure.  To accomplish this the kstat
   was moved in to spa->io_stats and the init/destroy code moved
   to spa_stats.c.

2. The I/O kstat was simply named <pool> which conflicted with the
   pool directory we had already created.  Therefore it was renamed
   to <pool>/io

3. An update handler was added to allow the kstat to be zeroed.
2013-10-31 09:16:03 -07:00
Matthew Ahrens
cb682a173a Illumos #3618 ::zio dcmd does not show timestamp data
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1643
2013-08-12 16:46:50 -07:00
George.Wilson
cc92e9d0c3 3246 ZFS I/O deadman thread
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation

*) Usage of the cyclic interface was replaced by the delayed taskq
   interface.  This avoids the need to implement new compatibility
   code and allows us to rely on the existing taskq implementation.

*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
   because declaring externs in source files as was done in the
   original patch is just plain wrong.

*) Instead of panicing the system when the deadman triggers a
   zevent describing the blocked vdev and the first pending I/O
   is posted.  If the panic behavior is desired Linux provides
   other generic methods to panic the system when threads are
   observed to hang.

*) For reference, to delay zios by 30 seconds for testing you can
   use zinject as follows: 'zinject -d <vdev> -D30 <pool>'

References:
  illumos/illumos-gate@283b84606b
  https://www.illumos.org/issues/3246

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1396
2013-05-01 17:05:52 -07:00
Brian Behlendorf
c418410393 Limit zfs_vdev_aggregation_limit to SPA_MAXBLOCKSIZE
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520
2012-10-15 09:28:43 -07:00
Brian Behlendorf
86dd0fd922 Pre-allocate vdev I/O buffers
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests.  Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...

1) These buffers are short lived.  They are only required for
the life of a single I/O at which point they can be used by
the next I/O.

2) The maximum number of concurrent buffers needed by a vdev is
small.  It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.

By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it.  This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock.  This is
particularly important when memory is already low on the system.

It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation.  The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:01:37 -07:00
Brian Behlendorf
c409e4647f Add missing ZFS tunables
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values.  However, in practice sometimes you do need to tweak these
values for one reason or another.  In those cases it's nice not to
have to resort to rebuilding from source.  All tunables are visable
to modinfo and the list is as follows:

$ modinfo module/zfs/zfs.ko
filename:       module/zfs/zfs.ko
license:        CDDL
author:         Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description:    ZFS
srcversion:     8EAB1D71DACE05B5AA61567
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm:           zvol_major:Major number for zvol device (uint)
parm:           zvol_threads:Number of threads for zvol device (uint)
parm:           zio_injection_enabled:Enable fault injection (int)
parm:           zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm:           zio_delay_max:Max zio millisec delay before posting event (int)
parm:           zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm:           zil_replay_disable:Disable intent logging replay (int)
parm:           zfs_nocacheflush:Disable cache flushes (bool)
parm:           zfs_read_chunk_size:Bytes to read per chunk (long)
parm:           zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm:           zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm:           zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm:           zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm:           zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm:           zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm:           zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm:           zfs_vdev_scheduler:I/O scheduler (charp)
parm:           zfs_vdev_cache_max:Inflate reads small than max (int)
parm:           zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm:           zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm:           zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm:           zfs_recover:Set to attempt to recover from fatal errors (int)
parm:           spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm:           zfs_zevent_len_max:Max event queue length (int)
parm:           zfs_zevent_cols:Max event column width (int)
parm:           zfs_zevent_console:Log events to the console (int)
parm:           zfs_top_maxinflight:Max I/Os per top-level (int)
parm:           zfs_resilver_delay:Number of ticks to delay resilver (int)
parm:           zfs_scrub_delay:Number of ticks to delay scrub (int)
parm:           zfs_scan_idle:Idle window in clock ticks (int)
parm:           zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm:           zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm:           zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm:           zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm:           zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm:           zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm:           zfs_no_write_throttle:Disable write throttling (int)
parm:           zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm:           zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm:           zfs_write_limit_min:Min tgx write limit (ulong)
parm:           zfs_write_limit_max:Max tgx write limit (ulong)
parm:           zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm:           zfs_write_limit_override:Override tgx write limit (ulong)
parm:           zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm:           zfetch_max_streams:Max number of streams per zfetch (uint)
parm:           zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm:           zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm:           zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm:           zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm:           zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
2011-05-04 10:02:37 -07:00
Brian Behlendorf
c28b227942 Add linux kernel module support
Setup linux kernel module support, this includes:
- zfs context for kernel/user
- kernel module build system integration
- kernel module macros
- kernel module symbol export
- kernel module options

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-31 13:41:58 -07:00
Brian Behlendorf
d6320ddb78 Fix gcc c90 compliance warnings
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-27 15:28:32 -07:00
Brian Behlendorf
428870ff73 Update core ZFS code from build 121 to build 141. 2010-05-28 13:45:14 -07:00
Brian Behlendorf
45d1cae3b8 Rebase master to b121 2009-08-18 11:43:27 -07:00
Brian Behlendorf
9babb37438 Rebase master to b117 2009-07-02 15:44:48 -07:00
Brian Behlendorf
d164b20935 Rebase master to b108 2009-02-18 12:51:31 -08:00
Brian Behlendorf
fb5f0bc833 Rebase master to b105 2009-01-15 13:59:39 -08:00
Brian Behlendorf
172bb4bd5e Move the world out of /zfs/ and seperate out module build tree 2008-12-11 11:08:09 -08:00