Commit Graph

1332 Commits

Author SHA1 Message Date
Richard Yao b22e279797 Only trigger SET_ERROR tracepoint event on error
Currently, the SET_ERROR tracepoint triggers regardless of whether there
is an error or not. On Illumos, SET_ERROR only triggers on an actual
error, which is avoids irrelevant noise. Linux 2.6.38 added support for
conditional tracepoints, so we modify SET_ERROR to use them when they
are avaliable for functionality equivalent to the Illumos functionality.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4043
2015-12-02 17:16:41 -08:00
Tim Chase e5f9a9afd2 Additional dkio support for TRIM/Discard
Replace DKIOCTRIM with DKIOCFREE and add additional support required
for Nextenta's TRIM support.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469
2015-12-02 13:44:35 -08:00
Chunwei Chen 61d482f7cd Linux 4.4 compat: xattr operations takes xattr_handler
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-01 16:48:25 -08:00
Jason Zaman 8fc851b7b5 sysmacros: Make P2ROUNDUP not trigger int overflow
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.

Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.

Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.

Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes #488
2015-11-13 15:21:52 -08:00
Justin T. Gibbs bc4501f75a Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass bass6@llnl.gov
Issue #3865
Issue #3443
2015-10-13 14:12:02 -07:00
Brian Behlendorf 9b13f65d28 Fix CPU hotplug
Allocate a kmem cache magazine for every possible CPU which might
be added to the system.  This ensures that when one of these CPUs
is enabled it can be safely used immediately.

For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint.  In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482
2015-10-13 09:50:40 -07:00
Chunwei Chen 374303a3c9 Use tab indent in rwlock.h
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:21:35 -07:00
Chunwei Chen a00b3eb58f rwsem use kernel provided owner when possible
If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner
field, so we don't need our own.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:21:32 -07:00
Chunwei Chen 4f8e643afe Don't take spin lock on rwlock owner
The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.

Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:20:55 -07:00
Lukas Wunner 784a7fe5d9 Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.

bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
2015-09-25 12:44:54 -07:00
Don Brady 56b3986316 Add large block support to zpios(1) benchmark
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.

  `zpios ... -S size | --blocksize size ...`

This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`

Example run below
(non realistic results from a VM and output abbreviated for space)

```
 --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
 --threaddelay=0 --cleanup --human-readable --verbose --cleanup
 --blocksize=128K,256K,512K,1M

 th-cnt  rg-cnt  rg-sz  ch-sz  blksz  wr-data wr-bw   rd-data rd-bw
---------------------------------------------------------------------
 4       750     8m     1m     128k   5g      90.06m  5g      93.37m
 4       750     8m     1m     256k   5g      79.71m  5g      99.81m
 4       750     8m     1m     512k   5g      42.20m  5g      93.14m
 4       750     8m     1m     1m     5g      35.51m  5g      89.36m
 8       750     8m     1m     128k   5g      85.49m  5g      90.81m
 8       750     8m     1m     256k   5g      61.42m  5g      99.24m
 8       750     8m     1m     512k   5g      49.09m  5g     108.78m
 16      750     8m     1m     128k   5g      86.28m  5g      88.73m
 16      750     8m     1m     256k   5g      64.34m  5g      93.47m
 16      750     8m     1m     512k   5g      68.84m  5g     124.47m
 16      750     8m     1m     1m     5g      53.97m  5g      97.20m
---------------------------------------------------------------------
```

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3795
Closes #2071
2015-09-22 09:13:20 -07:00
Ned Bass 3af56fd95f Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:

* Libzfs has historically included "xattr" in its list of default mount
  options. This overrides the dataset property, so the dataset is always
  configured to use directory-based xattrs even when the xattr dataset
  property is set to off or sa. Address this by removing "xattr" from
  the set of default mount options in libzfs.

* There was no way to enable system attribute-based extended attributes
  using temporary mount options. Add the mount options "saxattr" and
  "dirxattr" which enable the xattr behavior their names suggest.  This
  approach has the advantages of mirroring the valid xattr dataset
  property values and following existing conventions for mount option
  names.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
2015-09-19 14:04:14 -07:00
Arne Jansen 4e0f33ffe0 Illumos 6214 - zpools going south
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>

References:
  https://www.illumos.org/issues/6214
  http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/

Porting Notes:

Reintroduce b_compress to the l2arc_buf_hdr_t.  In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t.  This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held.  See Illumos 6214
for a detailed analysis of the race.

HDR_GET_COMPRESS() macro was removed from arc_buf_info().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3757
2015-09-11 11:14:38 -07:00
Richard Yao 8198d18ca7 Reintroduce IO accounting on zvols on Linux 3.19+
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later.  We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3741
2015-09-09 09:29:24 -07:00
Brian Behlendorf 3b36f8319d Add dbgmsg kstat
Internally ZFS keeps a small log to facilitate debugging.  By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1.  The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.

$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp    message
1441141408   spa=tank async request task=1
1441141408   txg 70 open pool version 5000; software version 5000/5; ...
1441141409   spa=tank async request task=32
1441141409   txg 72 import pool version 5000; software version 5000/5; ...
1441141414   command: lt-zpool import tank

Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log.  As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat.  For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.

$ ZFS_DEBUG=on ./cmd/ztest/ztest -V

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
2015-09-04 16:08:14 -07:00
Brian Behlendorf e20cd6f7a8 Merge branch 'zvol'
Performance improvements for zvols.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3720
2015-09-04 13:14:21 -07:00
Richard Yao 7d6e2adb4e Remove blk_rq_bytes()/blk_rq_sectors autotools checks
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao f952eaa7ec Remove blk_rq_pos() autotools check
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao f8c56b405d Remove blk_fetch_request() autotools check
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao e8c6be131c Remove blk_requeue_request() autotools check
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao dd6f9fe61b Remove blk_end_request() autotools check.
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao 65f340e725 Remove rq_is_sync() autotools check
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao 9ddf9b8e15 Remove rq_for_each_segment() autotools check
Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao 37f9dac592 zvol processing should use struct bio
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.

Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.

Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:

1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.

An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat.  Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.

Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:30:24 -04:00
Brian Behlendorf 0282c4137e Add temporary mount options
Add the required kernel side infrastructure to parse arbitrary
mount options.  This enables us to support temporary mount
options in largely the same way it is handled on other platforms.

See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #985
Closes #3351
2015-09-03 14:14:55 -07:00
Richard Yao 782b2c326e VDEV_REQ_FUA should be mapped to REQ_FUA
Pre-2.6.37 kernels support REQ_FUA in request flags, but not in BIO
flags. zvols are the only consumer of VDEV_REQ_FUA and since they are
passed requests, they should be obey the REQ_FUA flag like later
kernels. This optimization will only matter on 2.6.36 and 2.6.37 because
the zvol rework changes things to use bio, where we no longer are able
to distinguish on earlier kernels

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-02 12:39:08 -04:00
Tim Chase 69de34219a Dbuf hash table should be sized as is the arc hash table
Commit 49ddb31506 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table.  The dbuf hash table's size should be determined
similarly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3721
2015-09-02 09:33:02 -07:00
Richard Yao fb40095f5f Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
2015-09-01 15:22:07 -07:00
Richard Yao 97771edaca Remove blk_queue_io_opt() autotools check
This is needed for supporting kernels earlier than 2.6.30. Support for
those kernels was dropped, so we can safely remove this check.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:33:18 -07:00
Richard Yao 3c119330a6 Remove blk_queue_physical_block_size() autotools check
This is needed for supporting kernels earlier than 2.6.30. Support for
those kernels was dropped, so we can safely remove this check.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:33:18 -07:00
Brian Behlendorf 278bee9319 Linux 3.18 compat: Snapshot auto-mounting
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3589
Closes #3344
Closes #3295
Closes #3257
Closes #3243
Closes #3030
Closes #2841
2015-08-31 13:54:39 -07:00
Brian Behlendorf 4cb7b9c5d4 Check large block feature flag on volumes
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create.  Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.

In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3591
2015-08-28 09:25:03 -07:00
Chunwei Chen 17888ae30d Add compatibility layer for {kmap,kunmap}_atomic
Starting from linux-2.6.37, {kmap,kunmap}_atomic takes 1 argument instead of 2.
We use zfs_{kmap,kunmap}_atomic as wrappers and always take 2 argument, but
ignore the 2nd for newer kernel.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-08-24 10:13:25 -07:00
Chunwei Chen ae89cf0f34 Restructure uio to accommodate bio_vec
Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".

Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes #468
2015-08-24 10:10:21 -07:00
Tim Chase 851a549305 Include other sources of freeable memory in the freemem calculation
Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes #471
2015-08-19 09:25:30 -07:00
Chris Dunlop 9d4f86e825 Fix build failure with Linux 4.1 and FTRACE
See also #3546, commit c1718e9

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3673
2015-08-18 16:47:21 -07:00
Brian Behlendorf 8ac6ffecaf Remove needfree, desfree, lotsfree #defines
This patch reverts 77ab5dd.  This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places.  Those places have subsequently been updated to use
the Linux equivalent Linux functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637
2015-07-30 11:45:24 -07:00
Frédéric VANNIÈRE c1718e9580 Fix build failure with Linux 4.1 and FTRACE
Signed-off-by: Frédéric VANNIÈRE <f.vanniere@planet-work.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3546
2015-07-29 07:35:06 -07:00
Brian Behlendorf 9dc5ffbec8 Invert minclsyspri and maxclsyspri
On Linux the meaning of a processes priority is inverted with respect
to illumos.  High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

Note this change also reverts some of the priorities changes in prior
commit 62aa81a.  The rational is as follows:

spl_kmem_cache    - High priority may result in blocked memory allocs
spl_system_taskq  - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
2015-07-28 13:59:03 -07:00
Brian Behlendorf 1229323d5f Align thread priority with Linux defaults
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority.  Non-I/O filesystem
processes run with the default priority.  ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active.  The priorities have been updated to the following:

$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
     -  TS 10743  19 -20 [spl_kmem_cache]
     -  TS 10744  19 -20 [spl_system_task]
     -  TS 10745  19 -20 [spl_dynamic_tas]
     -  TS 10764  19   0 [dbu_evict]
     -  TS 10765  19   0 [arc_prune]
     -  TS 10766  19   0 [arc_reclaim]
     -  TS 10767  19   0 [arc_user_evicts]
     -  TS 10768  19   0 [l2arc_feed]
     -  TS 10769  39   0 [z_unmount]
     -  TS 10770  39 -20 [zvol]
     -  TS 11011  39 -20 [z_null_iss]
     -  TS 11012  39 -20 [z_null_int]
     -  TS 11013  39 -20 [z_rd_iss]
     -  TS 11014  39 -20 [z_rd_int_0]
     -  TS 11022  38 -19 [z_wr_iss]
     -  TS 11023  39 -20 [z_wr_iss_h]
     -  TS 11024  39 -20 [z_wr_int_0]
     -  TS 11032  39 -20 [z_wr_int_h]
     -  TS 11033  39 -20 [z_fr_iss_0]
     -  TS 11041  39 -20 [z_fr_int]
     -  TS 11042  39 -20 [z_cl_iss]
     -  TS 11043  39 -20 [z_cl_int]
     -  TS 11044  39 -20 [z_ioctl_iss]
     -  TS 11045  39 -20 [z_ioctl_int]
     -  TS 11046  39 -20 [metaslab_group_]
     -  TS 11050  19   0 [z_iput]
     -  TS 11121  38 -19 [z_wr_iss]

Note that under Linux the meaning of a processes priority is inverted
with respect to illumos.  High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

This patch depends on https://github.com/zfsonlinux/spl/pull/466.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3607
2015-07-28 13:36:47 -07:00
Brian Behlendorf 62aa81a577 Add defclsyspri macro
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority.  Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority.  This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.

All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri.  The vast majority of callers were
part of the test suite which won't have an external impact.  The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri.  This makes it more likely the process will be scheduled
which may help performance.

To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added.  When disabled (0) all newly created kernel
threads will use the default kernel thread priority.  When enabled (1)
the specified taskq priority will be used.  By default this value is
enabled (1).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-07-23 13:25:49 -07:00
Prakash Surya 36da08ef9b Illumos 5817 - change type of arcs_size from uint64_t to refcount_t
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5817
  https://github.com/illumos/illumos-gate/commit/2fd872a

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:28 -07:00
Matthew Ahrens ca67b33aba Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5376
  https://github.com/illumos/illumos-gate/commit/2ec99e3

Porting Notes:

The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.

* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back.  This function
has been customized for Linux so that it can be used to determine a low
memory.  This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base.  The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.

* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167.  Since there were related changes to this
function it made sense to include that work.

* The arc_init() function has been re-factored.  As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options.  The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3616
Closes #2167
2015-07-23 09:41:28 -07:00
Turbo Fredriksson 37d7cd94f3 Support parallel build trees (VPATH builds)
Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
2015-07-17 12:53:11 -07:00
Brian Behlendorf e80da86447 Linux 4.2 compat: bdi_setup_and_register()
The vfs_compat.h header should include the linux/backing-dev.h header
because it depends on the bdi_* functions defined there.  In previous
kernels this header was being indirectly included which prevented a
build failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3596
2015-07-17 09:15:43 -07:00
Matthew Ahrens 905edb405d Illumos 5347 - idle pool may run itself out of space
5347 idle pool may run itself out of space
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/231aab8
  https://github.com/illumos/illumos-gate/commit/4a92375 3642
  https://www.illumos.org/issues/5347
  https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix)
  https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390
  https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643

Porting notes:
This is completing the partial fix from FreeBSD

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3586
2015-07-14 10:35:21 -07:00
Justin T. Gibbs 99197f034e Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://github.com/illumos/illumos-gate/commit/db1741f
  https://www.illumos.org/issues/5661

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3571
2015-07-10 12:11:45 -07:00
Josef 'Jeff' Sipek 411bf201f5 Illumos 4745 - fix AVL code misspellings
4745 fix AVL code misspellings
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://github.com/illumos/illumos-gate/commit/6907ca4
  https://www.illumos.org/issues/4745

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3565
2015-07-10 11:58:37 -07:00
Alexander Motin e16b3fcc61 Illumos 5008 - lock contention (rrw_exit) while running a read only load
5008 lock contention (rrw_exit) while running a read only load
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Porting notes:

This patch ported perfectly cleanly to ZoL.  During testing 100% cached
small-block reads, extreme contention was noticed on rrl->rr_lock from
rrw_exit() due to the frequent entering and leaving ZPL.  Illumos picked
up this patch from FreeBSD and it also helps under Linux.

On a 1-minute 4K cached read test with 10 fio processes pinned to a single
socket on a 4-socket (10 thread per socket) NUMA system, contentions on
rrl->rr_lock were reduced from 508799 to 43085.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3555
2015-07-06 09:34:13 -07:00
Matthew Ahrens 4bda3bd0e7 Illumos 5911 - ZFS "hangs" while deleting file
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5911
  https://github.com/illumos/illumos-gate/commit/46e1baa

Porting notes:

Resolved ISO C90 forbids mixed declarations and code wanting in
the dnode_free_range() function.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3554
2015-07-06 09:31:42 -07:00