Commit Graph

3517 Commits

Author SHA1 Message Date
AndCycle 256fa983f4 Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit.  Otherwise, the arc_meta_limit
may be set the same as arc_max.

Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
2015-11-13 15:45:22 -08:00
loli10K 31f24932a4 Remove superfluous newline character
Remove superfluous `newline` character from spl_kmem_cache_magazine_size
module parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #499
2015-11-13 15:27:45 -08:00
tuxoko f5f2b87df0 Fix taskq dynamic spawning
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.

For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy

This will easily deadlock before this patch.

Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #496
2015-11-13 15:02:55 -08:00
Chunwei Chen 3e7e6f34d0 Don't call kmem_cache_shrink from shrinker
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes #487
2015-11-11 13:48:31 -08:00
Chunwei Chen 07d63f0cb9 Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().

Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.

We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
2015-10-13 15:57:17 -07:00
Chunwei Chen aa159afb56 Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.

We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
2015-10-13 15:25:33 -07:00
Justin T. Gibbs bc4501f75a Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass bass6@llnl.gov
Issue #3865
Issue #3443
2015-10-13 14:12:02 -07:00
Brian Behlendorf 9b13f65d28 Fix CPU hotplug
Allocate a kmem cache magazine for every possible CPU which might
be added to the system.  This ensures that when one of these CPUs
is enabled it can be safely used immediately.

For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint.  In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482
2015-10-13 09:50:40 -07:00
Brian Behlendorf 935434ef01 Fix 'arc_c < arc_c_min' panic
Strictly enforce keeping 'arc_c >= arc_c_min'.  The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
2015-10-13 09:23:35 -07:00
Richard Yao 919efe93cb zfs_inode_update should not call dmu_object_size_from_db under spinlock
We should never block when holding a spin lock, but zfs_inode_update can
block in the critical section of a spin lock in zfs_inode_update:

zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:47:40 -07:00
Richard Yao bc8ffb2d08 Remove obsolete zv_lock
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it.  Lets remove it as clean up.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:43:19 -07:00
Chunwei Chen 45838e3a41 Fix uioskip crash when skip to end
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3806
Closes #3850
2015-09-29 10:06:58 -07:00
Brian Behlendorf 2ebe396046 Fix PAX Patch/Grsec SLAB_USERCOPY panic
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled.  When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.

Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.

Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
2015-09-28 09:18:29 -07:00
Richard Yao b815ec32b3 Userspace can pass zero length segments via writev/readv
Userspace can trigger an assertion by passing a zero-length segment
when assertions are enabled:

[27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0)
[27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages()
[27961.614805] Call Trace:
[27961.614811]   dump_stack+0x45/0x57
[27961.614830]   spl_dumpstack+0x44/0x50 [spl]
[27961.614834]   spl_panic+0xbb/0x100 [spl]
[27961.614908]   uio_prefaultpages+0x134/0x140 [zcommon]
[27961.614930]   zfs_write+0x1fd/0xe80 [zfs]
[27961.615014]   zpl_write_common_iovec+0x7f/0x110 [zfs]
[27961.615035]   zpl_iter_write+0xa0/0xd0 [zfs]
[27961.615037]   do_iter_readv_writev+0x59/0x80
[27961.615063]   do_readv_writev+0x11b/0x260
[27961.615098]   vfs_writev+0x39/0x50
[27961.615100]   SyS_writev+0x4a/0xe0
[27961.615103]   system_call_fastpath+0x16/0x6e

The solution is to delete the assertion. This could potentially
occur in uiomove as well, which contains analogous assertions
that appear similarly unnecessary, so we remove those as well.

Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3792
2015-09-25 12:51:16 -07:00
Brian Behlendorf a3000f9358 Revert "dmu_objset_userquota_get_ids uses dn_bonus unsafely"
This reverts commit 5f8e1e8505.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
Issue #3789
2015-09-25 12:50:24 -07:00
Brian Behlendorf 5592404784 Fix synchronous behavior in __vdev_disk_physio()
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
2015-09-25 12:47:31 -07:00
Brian Behlendorf ef5b2e1048 Avoid blocking in arc_reclaim_thread()
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue #3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
2015-09-25 12:45:47 -07:00
Brian Behlendorf 04870568e6 Disable zpl_nr_cached_objects() callback
The zpl_nr_cached_objects() function has been disabled because in the
current code it doesn't provide any critical functionality and it may
result in a deadlock under certain circumstances.  However, because
we expect to need these hooks in the future this code has not been
entirely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3719
2015-09-25 12:45:42 -07:00
Brian Behlendorf d4787d55ad Allow NFS activity to defer snapshot unmounts
Accessing a snapshot via NFS should cause an auto-unmount of that
snapshot to be deferred until such as time as the snapshot is idle.
This is analogous to the zpl_revalidate logic employed by locally
mounted snapshots.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3794
2015-09-25 12:45:38 -07:00
Lukas Wunner 784a7fe5d9 Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.

bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
2015-09-25 12:44:54 -07:00
Don Brady 56b3986316 Add large block support to zpios(1) benchmark
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.

  `zpios ... -S size | --blocksize size ...`

This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`

Example run below
(non realistic results from a VM and output abbreviated for space)

```
 --regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
 --threaddelay=0 --cleanup --human-readable --verbose --cleanup
 --blocksize=128K,256K,512K,1M

 th-cnt  rg-cnt  rg-sz  ch-sz  blksz  wr-data wr-bw   rd-data rd-bw
---------------------------------------------------------------------
 4       750     8m     1m     128k   5g      90.06m  5g      93.37m
 4       750     8m     1m     256k   5g      79.71m  5g      99.81m
 4       750     8m     1m     512k   5g      42.20m  5g      93.14m
 4       750     8m     1m     1m     5g      35.51m  5g      89.36m
 8       750     8m     1m     128k   5g      85.49m  5g      90.81m
 8       750     8m     1m     256k   5g      61.42m  5g      99.24m
 8       750     8m     1m     512k   5g      49.09m  5g     108.78m
 16      750     8m     1m     128k   5g      86.28m  5g      88.73m
 16      750     8m     1m     256k   5g      64.34m  5g      93.47m
 16      750     8m     1m     512k   5g      68.84m  5g     124.47m
 16      750     8m     1m     1m     5g      53.97m  5g      97.20m
---------------------------------------------------------------------
```

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3795
Closes #2071
2015-09-22 09:13:20 -07:00
Ned Bass 3af56fd95f Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:

* Libzfs has historically included "xattr" in its list of default mount
  options. This overrides the dataset property, so the dataset is always
  configured to use directory-based xattrs even when the xattr dataset
  property is set to off or sa. Address this by removing "xattr" from
  the set of default mount options in libzfs.

* There was no way to enable system attribute-based extended attributes
  using temporary mount options. Add the mount options "saxattr" and
  "dirxattr" which enable the xattr behavior their names suggest.  This
  approach has the advantages of mirroring the valid xattr dataset
  property values and following existing conventions for mount option
  names.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
2015-09-19 14:04:14 -07:00
Brian Behlendorf 66aad10ce8 Fix NULL as mount(2) syscall data parameter
Passing NULL for the mount data should not result in EINVAL.  It
should be treated as if an empty string were passed and succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3771
2015-09-19 14:03:01 -07:00
Richard Yao f52ebcb3eb Discard on zvols should not exceed the length of a block
37f9dac592 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.

Reported-by: Richard Connon <richard.connon@zynstra.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3798
2015-09-19 14:00:14 -07:00
Arne Jansen 4e0f33ffe0 Illumos 6214 - zpools going south
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>

References:
  https://www.illumos.org/issues/6214
  http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/

Porting Notes:

Reintroduce b_compress to the l2arc_buf_hdr_t.  In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t.  This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held.  See Illumos 6214
for a detailed analysis of the race.

HDR_GET_COMPRESS() macro was removed from arc_buf_info().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3757
2015-09-11 11:14:38 -07:00
Brian Behlendorf 9965059ab9 Prefetch start and end of volumes
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume.  Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
2015-09-09 14:38:29 -07:00
Richard Yao d4bf6d8429 Disable direct reclaim in taskq worker threads on Linux 3.9+
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes #474
2015-09-09 14:30:47 -07:00
Richard Yao 8198d18ca7 Reintroduce IO accounting on zvols on Linux 3.19+
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later.  We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3741
2015-09-09 09:29:24 -07:00
Brian Behlendorf 3b36f8319d Add dbgmsg kstat
Internally ZFS keeps a small log to facilitate debugging.  By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1.  The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.

$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp    message
1441141408   spa=tank async request task=1
1441141408   txg 70 open pool version 5000; software version 5000/5; ...
1441141409   spa=tank async request task=32
1441141409   txg 72 import pool version 5000; software version 5000/5; ...
1441141414   command: lt-zpool import tank

Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log.  As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat.  For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.

$ ZFS_DEBUG=on ./cmd/ztest/ztest -V

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
2015-09-04 16:08:14 -07:00
Brian Behlendorf 0500e835af Support accessing .zfs/snapshot via NFS
This patch is based on the previous work done by @andrey-ve and
@yshui.  It triggers the automount by using kern_path() to traverse
to the known snapshout mount point.  Once the snapshot is mounted
NFS can access the contents of the snapshot.

Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server.  To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally.  System administators desiring this
functionally must explicitly enable it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2797
Closes #1655
Closes #616
2015-09-04 13:23:53 -07:00
Andrey Vesnovaty aa9b27080b Fix invalid fileid for snapshot root dentry
Prevents NFS client from detection of different fileids of snapshot root dentry
before & after snapshot mount.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-04 13:23:06 -07:00
Brian Behlendorf e20cd6f7a8 Merge branch 'zvol'
Performance improvements for zvols.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3720
2015-09-04 13:14:21 -07:00
Richard Yao fa56567630 Support secure discard on zvols
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao 37f9dac592 zvol processing should use struct bio
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.

Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.

Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:

1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.

An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat.  Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.

Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:30:24 -04:00
Tim Chase dca8c34da4 Prevent reclaim in the traverse prefetch thread
Reclaim in the traverse prefetch thread, which is run on the system
taskq, can overrun the stack.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3733
2015-09-04 08:43:28 -07:00
Brian Behlendorf 0282c4137e Add temporary mount options
Add the required kernel side infrastructure to parse arbitrary
mount options.  This enables us to support temporary mount
options in largely the same way it is handled on other platforms.

See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #985
Closes #3351
2015-09-03 14:14:55 -07:00
Tim Chase 69de34219a Dbuf hash table should be sized as is the arc hash table
Commit 49ddb31506 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table.  The dbuf hash table's size should be determined
similarly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3721
2015-09-02 09:33:02 -07:00
Brian Behlendorf 6cde64351e Add spa_slop_shift module option
Allow for easy turning of a pools reserved free space.  Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve.  Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3724
2015-09-02 09:30:18 -07:00
Richard Yao fb40095f5f Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
2015-09-01 15:22:07 -07:00
tuxoko cafbd2aca3 Check for RW_WRITE_HELD in zfs_inactive
Before read locking z_teardown_inactive_lock, we need to check if we have
already had write lock on it. Otherwise, we would deadlock on ourself when
doing rollback:

zfs_ioc_rollback
->zfs_suspend_fs (z_teardown_inactive_lock, RW_WRITER)
->zfs_resume_fs->zfs_rezget->zfs_iput_async->iput-> ...
  ->zfs_inactive (z_teardown_inactive_lock, RW_READER)

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2869
2015-09-01 10:17:57 -07:00
Brian Behlendorf 324dcd3733 Linux 4.2 compat: misc_deregister()
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels.  It was only used to log
an informational error message of no real value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:33:18 -07:00
Brian Behlendorf 4fa4cab972 Linux 4.2 compat: misc_deregister()
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels.  It was only used to log
an informational error message of no real value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:20:45 -07:00
Tim Chase a64e55752f Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
2015-09-01 08:46:41 -07:00
Brian Behlendorf 801b56090b Revert "Create a new thread during recursive taskq dispatch if necessary"
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing.  An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock.  The patch will need to be reworked.

VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
2015-08-31 17:03:01 -07:00
Tim Chase 076821eaff Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
2015-08-31 15:52:06 -07:00
Brian Behlendorf 278bee9319 Linux 3.18 compat: Snapshot auto-mounting
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3589
Closes #3344
Closes #3295
Closes #3257
Closes #3243
Closes #3030
Closes #2841
2015-08-31 13:54:39 -07:00
Andrey Vesnovaty b23975cbe0 zfsctl: No need to sync ctldir inodes
There's no metadata to write to disk for ctldir inodes. So we check if
a inode belongs to the ctldir in zpl_commit_metadata, and returns
immediately if it is.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2797
2015-08-31 13:54:39 -07:00
Richard Yao c6a3a222d3 Clear QUEUE_FLAG_ADD_RANDOM on zvols
zvols should not be an entropy source for the kernel. Disable it to be
consistent with the upstream kernel.

torvalds/linux@b277da0a8a

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3713
2015-08-30 10:11:57 -07:00
loli10K 3757bff3b1 Fix small typo
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3714
2015-08-30 10:10:16 -07:00
Tim Chase 36b454ab4c Initialize the taskq entry embedded within struct vdev
As part of the stack reduction effort in
50b25b2187, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object.  The rationale is the same as is
described in f467b05a26.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3709
2015-08-30 10:04:56 -07:00