Commit Graph

1062 Commits

Author SHA1 Message Date
Matthew Ahrens
5a28a9737a Illumos 6288 - dmu_buf_will_dirty could be faster
6288 dmu_buf_will_dirty could be faster
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6288
  https://github.com/illumos/illumos-gate/commit/0f2e7d0

Porting notes:
- [module/zfs/dbuf.c]
  - Fix 'warning: ISO C90 forbids mixed declarations and code'
    by moving 'dbuf_dirty_record_t *dr' to start of code block.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:13:52 -08:00
George Wilson
2e8efe1bef Illumos 6292 - exporting a pool while an async destroy
6292 exporting a pool while an async destroy is running can leave
entries in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/6292
  https://github.com/illumos/illumos-gate/commit/a443cc8

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:10:52 -08:00
Matthew Ahrens
5511754b4f Illumos 6319 - assertion failed in zio_ddt_write: bp->blk_birth == txg
6319 assertion failed in zio_ddt_write: bp->blk_birth == txg
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6319
  https://github.com/illumos/illumos-gate/commit/b39b744

Porting notes:
- Re-enabled ztest for CentOS test slaves.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3449
2016-01-12 09:10:52 -08:00
Matthew Ahrens
7f60329a26 Illumos 5987 - zfs prefetch code needs work
5987 zfs prefetch code needs work
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/5987 zfs prefetch code needs work
  illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work

Porting notes:
- [module/zfs/dbuf.c]
  - 5f6d0b6 Handle block pointers with a corrupt logical size
- [module/zfs/dmu_zfetch.c]
  - c65aa5b Fix gcc missing parenthesis warnings
  - 428870f Update core ZFS code from build 121 to build 141.
  - 79c76d5 Change KM_PUSHPAGE -> KM_SLEEP
  - b8d06fc Switch KM_SLEEP to KM_PUSHPAGE
  - Account for ISO C90 - mixed declarations and code - warnings
  - Module parameters (new/changed):
    - Replaced zfetch_block_cap with zfetch_max_distance
      (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024))
    - Preserved zfs_prefetch_disable as 'int' for consistency with
      existing Linux module options.
- [include/sys/trace_arc.h]
  - Added new tracepoints
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async);
    - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch);
- [man/man5/zfs-module-parameters.5]
  - Updated man page

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-12 09:02:33 -08:00
Brian Behlendorf
ab5cbbd107 Illumos 6293 - ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6293
  https://github.com/illumos/illumos-gate/commit/8fe00bf

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 14:10:31 -08:00
Brian Behlendorf
b870c7e5f4 Revert "Illumos 3749 - zfs event processing should work on R/O root filesystems"
This reverts commit b47637ecdc which
introduced a regression in ztest.

$ ./cmd/ztest/ztest -V
5 vdevs, 7 datasets, 23 threads, 300 seconds...
*** Error in `/rpool/home/behlendo/src/git/zfs/cmd/ztest/.libs/lt-ztest':
double free or corruption (fasttop): 0x0000000000d339f0 ***
2016-01-11 14:10:30 -08:00
Brian Behlendorf
b858767a31 Fix 'prevsnap property' build failure
Fix build failure accidentally introduced by 1715493.  This only
results in a failure when debugging is disabled.

dsl_dataset.c: In function 'dsl_dataset_stats':
dsl_dataset.c:1698:45: error: 'dp' undeclared (first use in this function)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 13:35:40 -08:00
Matthew Ahrens
1715493f38 Illumos 4929 - want prevsnap property
4929 want prevsnap property
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4929
  https://github.com/illumos/illumos-gate/commit/b461c74

Porting notes:
- [include/sys/fs/zfs.h]
  - f67d70 Create an 'overlay' property
  - 11b9ec Add full SELinux support
- [fs/zfs/dsl_dataset.c]
  - This increases the stack size of dsl_dataset_stats() but
    nothing has been changed until this is shown to be an issue.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 11:58:26 -08:00
Marcel Telka
f3c9dca093 Illumos 4638 - Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4638
  https://github.com/illumos/illumos-gate/commit/2144b12

Porting notes:
- [module/zfs/zfs_vnops.c]
  - 3558fd7 Prototype/structure update for Linux
  - 2cf7f52 Linux compat 2.6.39: mount_nodev()
  - Use zfs_is_readonly() wrapper
  - Remove first line of comment which doesn't apply

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 10:29:48 -08:00
Will Andrews
b47637ecdc Illumos 3749 - zfs event processing should work on R/O root filesystems
3749 zfs event processing should work on R/O root filesystems
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3749
  https://github.com/illumos/illumos-gate/commit/3cb69f7

Porting notes:
- [include/sys/spa_impl.h]
  - ffe9d38 Add generic errata infrastructure
  - 1421c89 Add visibility in to arc_read
- [include/sys/fm/fs/zfs.h]
  - 2668527 Add linux events
  - 6283f55 Support custom build directories and move includes
- [module/zfs/spa_config.c]
  - Updated spa_config_sync() to match illumos with the exception
    of a Linux specific block.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-11 09:23:37 -08:00
Brian Behlendorf
e9e3d31d2c Allow 16M send/recv blocks
Fix an off by one error introduced by fcff0f3 which triggers an
assertion when 16M blocks are used with send/recv.  This fix was
intentionally not folder in to the Illumos commit so it can be
easily cherry-picked by upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-08 20:23:23 -05:00
Paul Dagnelie
fcff0f35bd Illumos 5960, 5925
5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/5960
  https://www.illumos.org/issues/5925
  https://github.com/illumos/illumos-gate/commit/a2cdcdd

Porting notes:
- [lib/libzfs/libzfs_sendrecv.c]
  - b8864a2 Fix gcc cast warnings
  - 325f023 Add linux kernel device support
  - 5c3f61e Increase Linux pipe buffer size on 'zfs receive'
- [module/zfs/zfs_vnops.c]
  - 3558fd7 Prototype/structure update for Linux
  - c12e3a5 Restructure zfs_readdir() to fix regressions
- [module/zfs/zvol.c]
  - Function @zvol_map_block() isn't needed in ZoL
  - 9965059 Prefetch start and end of volumes
- [module/zfs/dmu.c]
  - Fixed ISO C90 - mixed declarations and code
  - Function dmu_prefetch() 'int i' is initialized before
    the following code block (c90 vs. c99)
- [module/zfs/dbuf.c]
  - fc5bb51 Fix stack dbuf_hold_impl()
  - 9b67f60 Illumos 4757, 4913
  - 34229a2 Reduce stack usage for recursive traverse_visitbp()
- [module/zfs/dmu_send.c]
  - Fixed ISO C90 - mixed declarations and code
  - b58986e Use large stacks when available
  - 241b541 Illumos 5959 - clean up per-dataset feature count code
  - 77aef6f Use vmem_alloc() for nvlists
  - 00b4602 Add linux kernel memory support

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-08 15:08:19 -08:00
Richard Sharpe
c5d0287011 Fix casesensitivity=insensitive deadlock
When casesensitivity=insensitive is set for the
file system, we can deadlock in a rename if the user uses different case
for each path. For example rename("A/some-file.txt", "a/some-file.txt").

The simple test for this is:

1. mkdir some-dir in a ZFS file system
2. touch some-dir/some-file.txt
3. mv Some-dir/some-file.txt some-dir/some-other-file.txt

This last request deadlocks trying to relock the i_mutex on the inode for
the parent directory.

The solution is to use d_add_ci in zpl_lookup if we are on a file system
that has the casesensitivity=insensitive attribute set.

This patch checks if we are working on a case insensitive file system and if
so, allocates storage for the case insensitive name and passes it to
zfs_lookup and then calls d_add_ci instead of d_splice_alias.

The performance impact seems to be minimal even though we have introduced a
kmalloc and kfree in the lookup path.

The problem was found when running Microsoft's FSCT against Samba on top of
ZFS On Linux.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4136
2016-01-08 11:05:07 -08:00
Jeremy Jones
b23ad7f350 Illumos 3139 - zdb dies when it tries to determine path of unlinked file
3139 zdb dies when it tries to determine path of unlinked file
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://github.com/illumos/illumos-gate/commit/1ce39b5
  https://www.illumos.org/issues/3139

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-05 11:25:41 -08:00
Matthew Ahrens
37f8a8835a Illumos 5746 - more checksumming in zfs send
5746 more checksumming in zfs send
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>

References:
  https://www.illumos.org/issues/5746
  https://github.com/illumos/illumos-gate/commit/98110f0
  https://github.com/zfsonlinux/zfs/issues/905

Porting notes:
- Minor conflicts due to:
  - https://github.com/zfsonlinux/zfs/commit/2024041
  - https://github.com/zfsonlinux/zfs/commit/044baf0
  - https://github.com/zfsonlinux/zfs/commit/88904bb
- Fix ISO C90 warnings (-Werror=declaration-after-statement)
  - arc_buf_t *abuf;
  - dmu_buf_t *bonus;
  - zio_cksum_t cksum_orig;
  - zio_cksum_t *cksump;
- Fix format '%llx' format specifier warning
- Align message in zstreamdump safe_malloc() with upstream

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3611
2015-12-30 14:24:14 -08:00
Ned Bass
43b4935e53 Prevent SA length overflow
The function sa_update() accepts a 32-bit length parameter and
assigns it to a 16-bit field in sa_bulk_attr_t, potentially
truncating the passed-in value. This could lead to corrupt system
attribute (SA) records getting written to the pool. Add a VERIFY to
sa_update() to detect cases where overflow would occur. The SA length
is limited to 16-bit values by the on-disk format defined by
sa_hdr_phys_t.

The function zfs_sa_set_xattr() is vulnerable to this bug if the
unpacked nvlist of xattrs is less than 64k in size but the packed
size is greater than 64k. Fix this by appropriately checking the
size of the packed nvlist before calling sa_update(). Add error
handling to zpl_xattr_set_sa() to keep the cached list of SA-based
xattrs consistent with the data on disk.

Lastly, zfs_sa_set_xattr() calls dmu_tx_abort() on an assigned
transaction if sa_update() returns an error, but the DMU only allows
unassigned transactions to be aborted. Wrap the sa_update() call in a
VERIFY0, remove the transaction abort, and call dmu_tx_commit()
unconditionally. This is consistent practice with other callers
of sa_update().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4150
2015-12-30 13:20:12 -08:00
Chunwei Chen
f5f087eb88 Make xattr dir truncate and remove in one tx
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
2015-12-28 09:48:26 -08:00
Chunwei Chen
29572ccdef Fix empty xattr dir causing lockup
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4114
Closes #4052
Closes #4006
Closes #3018
Closes #2861
2015-12-28 09:41:30 -08:00
Brian Behlendorf
2ebc7b72b3 Fix z_xattr_lock/z_teardown_lock inversion
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock.  Resolve this by taking the z_teardown_lock in
all registered xattr callbacks prior to taking the z_xattr_lock.
This ensures the locks are always taken is the same order thus
preventing a deadlock.  Note the z_teardown_lock is taken again
in zfs_lookup() and this is safe because the z_teardown lock is
a re-entrant read reader/writer lock.

* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
  __zpl_xattr_get
    zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro

* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
  zfs_resume_fs
    zfs_rezget -> Takes zp->z_xattr_lock

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3943
Closes #3969
Closes #4121
2015-12-22 16:59:20 -08:00
Brian Behlendorf
228b461b56 Revert "Fix z_xattr_lock/z_teardown_lock lock inversion"
This reverts commit 6b32ef572f754efc3f9edb20d022450f8e6b02d9.
2015-12-22 16:58:43 -08:00
Brian Behlendorf
151f84e2c3 Fix ztest truncated cache file
Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
truncate and overwrite rather than rename the cache file.  This is
the correct fix but it should have only been applied for the kernel
build.  In user space rename(2) is needed because ztest depends on
the cache file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4129
2015-12-22 10:40:40 -08:00
Olaf Faaland
448d7aaabc Identify locks flagged by lockdep
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.

This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark
such mutexes when they are initialized.  This mutex type causes
attempts to take or release those locks to be wrapped in lockdep_off()
and lockdep_on() calls to silence the dependency checker and allow the
use of lock_stats to examine contention.

For RW locks, it uses an analogous lock type, RW_NOLOCKDEP.

The goal is that these locks are ultimately changed back to type
MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect
their relationship (e.g. z_name_lock below) or any real problem with the
lock dependencies are fixed.

Some of the affected locks are:

tc_open_lock:
=============
This is an array of locks, all with same name, which txg_quiesce must
take all of in order to move txg to next state.  All default to the same
lockdep class, and so to lockdep appears recursive.

zp->z_name_lock:
================
In zfs_rmdir,
        dzp = znode for the directory (input to zfs_dirent_lock)
        zp  = znode for the entry being removed (output of zfs_dirent_lock)

zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp
zfs_rmdir() takes z_name_lock in zp

Since both dzp and zp are type znode_t, the locks have the same default
class, and lockdep considers it a possible recursive lock attempt.

l->l_rwlock:
============
zap_expand_leaf() sometimes creates two new zap leaf structures, via
these call paths:

zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open()
zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf()

Because both zap_leaf_open() and zap_create_leaf() initialize
l->l_rwlock in their (separate) leaf structures, the lockdep class is
the same, and the linux kernel believes these might both be the same
lock, and emits a possible recursive lock warning.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3895
2015-12-22 10:21:33 -08:00
DHE
dcb6bed1df Make zio_taskq_batch_pct user configurable
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4110
2015-12-18 13:46:23 -08:00
Brian Behlendorf
a58df6f536 Fix zfs_vdev_aggregation_limit bounds checking
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-18 13:32:06 -08:00
Brian Behlendorf
6fe53787f3 Fix vdev_queue_aggregate() deadlock
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3808
Closes #3867
2015-12-18 13:27:12 -08:00
Brian Behlendorf
a8ad3bf02c Fix z_xattr_lock/z_teardown_lock lock inversion
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock.  Detect this case and return EBUSY so zfs_resume_fs()
will mark the inode stale and it can be safely revalidated on next
access.

* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
  __zpl_xattr_get
    zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro

* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
  zfs_resume_fs
    zfs_rezget -> Takes zp->z_xattr_lock

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3969
2015-12-18 13:17:44 -08:00
Chunwei Chen
2727b9d3b6 Use uio for zvol_{read,write}
Since uio now supports bvec, we can convert bio into uio and reuse
dmu_{read,write}_uio. This way, we can remove some duplicate code.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4078
2015-12-15 16:21:43 -08:00
Chunwei Chen
502923bb44 Fix uio_prefaultpages for 0 length iovec
Userspace can freely pass in whatever iovec it feels like, and it's perfectly
legal to pass an iovec which contains a zero length segment. In the current
implementation, uio_prefaultpages would touch an out of bound byte in the
"last byte" logic. While this probably wouldn't cause any critical error, we
would like uio_prefaultpages to be able to continue gracefully.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4078
2015-12-15 16:19:55 -08:00
Brian Behlendorf
eba9e745dc Handle damaged blk_birth in dsl_deadlist_insert()
If a bit were cleared in `bp->blk_birth` such that the txg birth
was now lower than any other txg_birth in the deadlist, then there
will be no entry before this in the tree.

This should be impossible but regardless error handling code has
been added for this case.  By default this is left as a fatal case
and the blk_birth is logged.  However, setting `zfs_recover=1` will
cause the bp to be placed at the start of the deadlist even though
it contains an invalid blk_birth.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4086
Closes #4089
2015-12-15 16:12:31 -08:00
Brian Behlendorf
1cdb86cba2 Handle block pointers with a corrupt logical size
Commit 5f6d0b6 was originally added to gracefully handle block
pointers with a damaged logical size.  However, it incorrectly
assumed that all passed arc_done_func_t could handle a NULL
arc_buf_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4069
Closes #4080
2015-12-15 16:11:44 -08:00
Brian Behlendorf
245b7ab3d1 Hold the zfs_snapentry_t before dispatch
While exceptionally unlikely to cause a problem the zfs_snapentry_t
hold should be taken before the dispatch to prevent any possibility
of the task being processed before the hold.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 12:06:31 -08:00
Chunwei Chen
1997660170 Fix snapshot automount race cause EREMOTE
When a concorrent mount finishes just before calling to
zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return
with EREMOTE. We should instead just return 0, so VFS may retry and
would likely notice the dentry is alreadly mounted. This will be
inline with when usermode helper return EBUSY.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-14 12:06:31 -08:00
Brian Behlendorf
5ed27c572c Change zfs_snapshot_lock from mutex to rw lock
By changing the zfs_snapshot_lock from a mutex to a rw lock the
zfsctl_lookup_objset() function can be allowed to run concurrently.
This should reduce the latency of fh_to_dentry lookups in ZFS
snapshots which are being accessed over NFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 12:06:31 -08:00
Brian Behlendorf
f22f900f15 Fix zfsctl_lookup_objset() deadlock
The zfsctl_snapshot_unmount_delay() function must not be called
from zfsctl_lookup_objset() while it is currently holding the
zfs_snapshot_lock.  This will result in a deadlock.  It is safe
to call zfsctl_snapshot_unmount_delay_impl() directly because the
function already has a reference on the zfs_snapentry_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3997
2015-12-14 12:05:52 -08:00
Brian Behlendorf
5e94284fe5 Set 'zfs_expire_snapshot=0' to disable auto-unmount
There are cases where it's desirable that auto-mounted snapshots
not expire after a fixed duration.  They should be unmounted only
when the filesystem they are a snapshot of is unmounted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2015-12-14 11:02:32 -08:00
Chunwei Chen
24ef51f660 Use spa as key besides objsetid for snapentry
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3948
Issue #3786
Issue #3887
2015-12-08 16:38:56 -08:00
Brian Behlendorf
b58986eebf Use large stacks when available
While stack size will vary by architecture it has historically defaulted to
8K on x86_64 systems.  However, as of Linux 3.15 the default thread stack
size was increased to 16K.  These kernels are now the default in most non-
enterprise distributions which means we no longer need to assume 8K stacks.

This patch takes advantage of that fact by appropriately reverting stack
conservation changes which were made to ensure stability.  Changes which
may have had a negative impact on performance for certain workloads.  This
also has the side effect of bringing the code slightly more in line with
upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4059
2015-12-07 12:20:43 -08:00
Matthew Ahrens
241b541574 Illumos 5959 - clean up per-dataset feature count code
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5959
  https://github.com/illumos/illumos-gate/commit/ca0cc39

Porting notes:

illumos code doesn't check for feature_get_refcount() returning
ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added
a check in https://github.com/zfsonlinux/zfs/commit/784652c
due to #3468. The check was reintroduced here.

Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3965
2015-12-04 14:20:20 -08:00
Brian Behlendorf
072484504f Add zap_prefetch() interface
Provide a generic interface to prefetch ZAP entries by name.  This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #4061
2015-12-04 09:39:20 -08:00
tuxoko
b0fe1adeb1 Prevent rm modules.* when make install
This was originally in fe0ed8f910, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4027
2015-12-02 14:39:12 -08:00
Chunwei Chen
61d482f7cd Linux 4.4 compat: xattr operations takes xattr_handler
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-01 16:48:25 -08:00
Chunwei Chen
1a09371678 Linux 4.4 compat: make_request_fn returns blk_qc_t
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
2015-12-01 16:48:08 -08:00
tuxoko
43518d92fd Fix zfs_dirty_data_max overflow on 32-bit
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.

On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3973
2015-11-19 16:02:47 -08:00
tuxoko
d0c614ecf9 Fix null pointer in arc_kmem_reap_now on 32-bit
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
2015-11-19 16:01:47 -08:00
Chunwei Chen
d287880afd Fix snapshot automount behavior when concurrent or fail
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.

Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.

The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4018
2015-11-19 15:36:59 -08:00
Brian Behlendorf
3d8d245fb3 Follow 0/-E convention for module load errors
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned.  This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter.  This is what was causing the hard failures
in the automated testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-11-16 16:10:06 -08:00
AndCycle
256fa983f4 Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit.  Otherwise, the arc_meta_limit
may be set the same as arc_max.

Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
2015-11-13 15:45:22 -08:00
Chunwei Chen
07d63f0cb9 Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().

Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.

We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
2015-10-13 15:57:17 -07:00
Chunwei Chen
aa159afb56 Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.

We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
2015-10-13 15:25:33 -07:00
Justin T. Gibbs
bc4501f75a Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass bass6@llnl.gov
Issue #3865
Issue #3443
2015-10-13 14:12:02 -07:00