Commit Graph

1290 Commits

Author SHA1 Message Date
tuxoko
d0c614ecf9 Fix null pointer in arc_kmem_reap_now on 32-bit
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
2015-11-19 16:01:47 -08:00
Chunwei Chen
d287880afd Fix snapshot automount behavior when concurrent or fail
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.

Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.

The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4018
2015-11-19 15:36:59 -08:00
Brian Behlendorf
3d8d245fb3 Follow 0/-E convention for module load errors
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned.  This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter.  This is what was causing the hard failures
in the automated testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-11-16 16:10:06 -08:00
AndCycle
256fa983f4 Obey arc_meta_limit default size when changing arc_max
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit.  Otherwise, the arc_meta_limit
may be set the same as arc_max.

Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4001
2015-11-13 15:45:22 -08:00
Chunwei Chen
07d63f0cb9 Fix fail path in zfs_znode_alloc
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().

Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.

We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
2015-10-13 15:57:17 -07:00
Chunwei Chen
aa159afb56 Fix use-after-free in vdev_disk_physio_completion
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.

We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
2015-10-13 15:25:33 -07:00
Justin T. Gibbs
bc4501f75a Illumos 6267 - dn_bonus evicted too early
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/6267
  https://github.com/illumos/illumos-gate/commit/d205810

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Ned Bass bass6@llnl.gov
Issue #3865
Issue #3443
2015-10-13 14:12:02 -07:00
Brian Behlendorf
935434ef01 Fix 'arc_c < arc_c_min' panic
Strictly enforce keeping 'arc_c >= arc_c_min'.  The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
2015-10-13 09:23:35 -07:00
Richard Yao
919efe93cb zfs_inode_update should not call dmu_object_size_from_db under spinlock
We should never block when holding a spin lock, but zfs_inode_update can
block in the critical section of a spin lock in zfs_inode_update:

zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:47:40 -07:00
Richard Yao
bc8ffb2d08 Remove obsolete zv_lock
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it.  Lets remove it as clean up.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
2015-09-30 10:43:19 -07:00
Brian Behlendorf
a3000f9358 Revert "dmu_objset_userquota_get_ids uses dn_bonus unsafely"
This reverts commit 5f8e1e8505.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
Issue #3789
2015-09-25 12:50:24 -07:00
Brian Behlendorf
5592404784 Fix synchronous behavior in __vdev_disk_physio()
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
2015-09-25 12:47:31 -07:00
Brian Behlendorf
ef5b2e1048 Avoid blocking in arc_reclaim_thread()
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue #3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
2015-09-25 12:45:47 -07:00
Brian Behlendorf
04870568e6 Disable zpl_nr_cached_objects() callback
The zpl_nr_cached_objects() function has been disabled because in the
current code it doesn't provide any critical functionality and it may
result in a deadlock under certain circumstances.  However, because
we expect to need these hooks in the future this code has not been
entirely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3719
2015-09-25 12:45:42 -07:00
Brian Behlendorf
d4787d55ad Allow NFS activity to defer snapshot unmounts
Accessing a snapshot via NFS should cause an auto-unmount of that
snapshot to be deferred until such as time as the snapshot is idle.
This is analogous to the zpl_revalidate logic employed by locally
mounted snapshots.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3794
2015-09-25 12:45:38 -07:00
Lukas Wunner
784a7fe5d9 Linux 4.3 compat: bio_end_io_t / BIO_UPTODATE
Commit torvalds/linux@4246a0b63b
("block: add a bi_error field to struct bio") dropped the error
argument from bio_endio in favor of newly introduced bio->bi_error.
This also replaces bio->bi_flags value BIO_UPTODATE.

bio_endio was a 3 argument function until Linux 2.6.24, which made it
a 2 argument function, and now the prototype has changed yet again to
a 1 argument function. Support for pre 2.6.24 kernels was already
dropped with 37f9dac592 ("zvol processing should use struct bio")
which assumed the 2 argument version in zvol_request(). Remaining code
to support the 3 argument version is hereby removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Issue #3799
2015-09-25 12:44:54 -07:00
Ned Bass
3af56fd95f Honor xattr=sa dataset property
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:

* Libzfs has historically included "xattr" in its list of default mount
  options. This overrides the dataset property, so the dataset is always
  configured to use directory-based xattrs even when the xattr dataset
  property is set to off or sa. Address this by removing "xattr" from
  the set of default mount options in libzfs.

* There was no way to enable system attribute-based extended attributes
  using temporary mount options. Add the mount options "saxattr" and
  "dirxattr" which enable the xattr behavior their names suggest.  This
  approach has the advantages of mirroring the valid xattr dataset
  property values and following existing conventions for mount option
  names.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3787
2015-09-19 14:04:14 -07:00
Brian Behlendorf
66aad10ce8 Fix NULL as mount(2) syscall data parameter
Passing NULL for the mount data should not result in EINVAL.  It
should be treated as if an empty string were passed and succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3771
2015-09-19 14:03:01 -07:00
Richard Yao
f52ebcb3eb Discard on zvols should not exceed the length of a block
37f9dac592 replaced the end-start
calculation with a cached value, but neglected to update it on discard
operations. This can cause us to discard data not requested, causing
data loss on zvols.

Reported-by: Richard Connon <richard.connon@zynstra.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3798
2015-09-19 14:00:14 -07:00
Arne Jansen
4e0f33ffe0 Illumos 6214 - zpools going south
6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>

References:
  https://www.illumos.org/issues/6214
  http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/

Porting Notes:

Reintroduce b_compress to the l2arc_buf_hdr_t.  In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t.  This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held.  See Illumos 6214
for a detailed analysis of the race.

HDR_GET_COMPRESS() macro was removed from arc_buf_info().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3757
2015-09-11 11:14:38 -07:00
Brian Behlendorf
9965059ab9 Prefetch start and end of volumes
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume.  Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
2015-09-09 14:38:29 -07:00
Richard Yao
8198d18ca7 Reintroduce IO accounting on zvols on Linux 3.19+
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later.  We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3741
2015-09-09 09:29:24 -07:00
Brian Behlendorf
3b36f8319d Add dbgmsg kstat
Internally ZFS keeps a small log to facilitate debugging.  By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1.  The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.

$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp    message
1441141408   spa=tank async request task=1
1441141408   txg 70 open pool version 5000; software version 5000/5; ...
1441141409   spa=tank async request task=32
1441141409   txg 72 import pool version 5000; software version 5000/5; ...
1441141414   command: lt-zpool import tank

Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log.  As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat.  For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.

$ ZFS_DEBUG=on ./cmd/ztest/ztest -V

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3728
2015-09-04 16:08:14 -07:00
Brian Behlendorf
0500e835af Support accessing .zfs/snapshot via NFS
This patch is based on the previous work done by @andrey-ve and
@yshui.  It triggers the automount by using kern_path() to traverse
to the known snapshout mount point.  Once the snapshot is mounted
NFS can access the contents of the snapshot.

Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server.  To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally.  System administators desiring this
functionally must explicitly enable it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2797
Closes #1655
Closes #616
2015-09-04 13:23:53 -07:00
Andrey Vesnovaty
aa9b27080b Fix invalid fileid for snapshot root dentry
Prevents NFS client from detection of different fileids of snapshot root dentry
before & after snapshot mount.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-04 13:23:06 -07:00
Brian Behlendorf
e20cd6f7a8 Merge branch 'zvol'
Performance improvements for zvols.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3720
2015-09-04 13:14:21 -07:00
Richard Yao
fa56567630 Support secure discard on zvols
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:37:24 -04:00
Richard Yao
37f9dac592 zvol processing should use struct bio
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.

Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.

Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:

1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.

An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat.  Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.

Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2015-09-04 15:30:24 -04:00
Tim Chase
dca8c34da4 Prevent reclaim in the traverse prefetch thread
Reclaim in the traverse prefetch thread, which is run on the system
taskq, can overrun the stack.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3733
2015-09-04 08:43:28 -07:00
Brian Behlendorf
0282c4137e Add temporary mount options
Add the required kernel side infrastructure to parse arbitrary
mount options.  This enables us to support temporary mount
options in largely the same way it is handled on other platforms.

See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #985
Closes #3351
2015-09-03 14:14:55 -07:00
Tim Chase
69de34219a Dbuf hash table should be sized as is the arc hash table
Commit 49ddb31506 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table.  The dbuf hash table's size should be determined
similarly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3721
2015-09-02 09:33:02 -07:00
Brian Behlendorf
6cde64351e Add spa_slop_shift module option
Allow for easy turning of a pools reserved free space.  Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve.  Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3724
2015-09-02 09:30:18 -07:00
Richard Yao
fb40095f5f Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
2015-09-01 15:22:07 -07:00
tuxoko
cafbd2aca3 Check for RW_WRITE_HELD in zfs_inactive
Before read locking z_teardown_inactive_lock, we need to check if we have
already had write lock on it. Otherwise, we would deadlock on ourself when
doing rollback:

zfs_ioc_rollback
->zfs_suspend_fs (z_teardown_inactive_lock, RW_WRITER)
->zfs_resume_fs->zfs_rezget->zfs_iput_async->iput-> ...
  ->zfs_inactive (z_teardown_inactive_lock, RW_READER)

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2869
2015-09-01 10:17:57 -07:00
Brian Behlendorf
324dcd3733 Linux 4.2 compat: misc_deregister()
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels.  It was only used to log
an informational error message of no real value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:33:18 -07:00
Brian Behlendorf
278bee9319 Linux 3.18 compat: Snapshot auto-mounting
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3589
Closes #3344
Closes #3295
Closes #3257
Closes #3243
Closes #3030
Closes #2841
2015-08-31 13:54:39 -07:00
Andrey Vesnovaty
b23975cbe0 zfsctl: No need to sync ctldir inodes
There's no metadata to write to disk for ctldir inodes. So we check if
a inode belongs to the ctldir in zpl_commit_metadata, and returns
immediately if it is.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2797
2015-08-31 13:54:39 -07:00
Richard Yao
c6a3a222d3 Clear QUEUE_FLAG_ADD_RANDOM on zvols
zvols should not be an entropy source for the kernel. Disable it to be
consistent with the upstream kernel.

torvalds/linux@b277da0a8a

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3713
2015-08-30 10:11:57 -07:00
loli10K
3757bff3b1 Fix small typo
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3714
2015-08-30 10:10:16 -07:00
Tim Chase
36b454ab4c Initialize the taskq entry embedded within struct vdev
As part of the stack reduction effort in
50b25b2187, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object.  The rationale is the same as is
described in f467b05a26.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3709
2015-08-30 10:04:56 -07:00
Tim Chase
d439f63ff5 Allow recovery from corrupted snapshot maps
If the ZAP object containing a snapshot map is corrupted due to an
unrecoverable checksum error or otherwise, dsl_dataset_name() will
normally panic the system due to its VERIFY.

This patch attempts to allow a recovery avenue from such situations by
manufacturing a descriptive snapshot name and then ignoring the error.
Scrubbing a pool with this type of corruption will then show the affected
object in the error list rather than panicking.

The recovery code is only enabled when the zfs_recover module parameter
is set.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3705
2015-08-28 11:56:32 -07:00
Brian Behlendorf
4cb7b9c5d4 Check large block feature flag on volumes
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create.  Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.

In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3591
2015-08-28 09:25:03 -07:00
Brian Behlendorf
c495fe2c1c Limit max_hw_sectors_kb to 16M
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.

This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs.  The solution is to restore the maximum
size to ~10M.  This patch specifically changes it to 16M which is
close enough.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3684
2015-08-28 09:16:59 -07:00
Chunwei Chen
5475aada94 Linux 4.1 compat: loop device on ZFS
Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.

With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.

The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3511
Closes #3640
2015-08-24 10:17:06 -07:00
Brian Behlendorf
efc412b645 Linux 4.2 compat: vfs_rename()
The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place.  This way there can never
exist a partial version of the file, it's all or nothing.

Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms.  Even doing
basic I/O to a file system from within the kernel is strongly
discouraged.  In order to support this at all the vn_rename()
implementation ends up being complex and fragile.  So fragile
that recent Linux 4.2 changes have broken it.

While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely.  Then all this complex, fragile code can
be removed.  Achieving this is straight forward because
config_write() is the only consumer of vn_rename().

This patch reworks spa_config_write() to update the cache file
in place.  The file will be truncated, written out, and then
synced to disk.  If an error is encountered the file will be
unlinked leaving the system in a consistent state.

This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file.  However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.

If this were to somehow happen it poses no risk to pool.  Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.

Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3653
2015-08-19 16:04:33 -07:00
Brian Behlendorf
ff9b1d0725 Handle zap_lookup() failure in ddt_object_load()
Failing to lookup a name in the spa_ddt_stat_object should not result
in a panic in ddt_object_load().  The error can be safely returned to
the caller for handling resulting in a useful user error message.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3370
2015-08-19 14:32:50 -07:00
tuxoko
6d79eabf9f Add parenthesis to the ternary operator
Without the parenthesis, this particular ASSERT will evaluate to
"(RW_READER == (!zap->zap_ismicro && fatreader)) ? RW_READER : lti"

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3685
2015-08-19 11:28:41 -07:00
Brian Behlendorf
7e8bddd019 Update arc_memory_throttle() to check pageout
This brings the behavior of arc_memory_throttle() back in sync with
illumos.  The updated memory throttling policy roughly goes like this:

* Never throttle if more than 10% of memory is free.  This threshold
  is configurable with the zfs_arc_lotsfree_percent module option.

* Minimize any throttling of kswapd even when free memory is below
  the set threshold.  Allow it to write out pages as quickly as
  possible to help alleviate the memory pressure.

* Delay all other threads when free memory is below the set threshold
  in order to avoid compounding the memory pressure.  Buffers will be
  evicted from the ARC to reduce the issue.

The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
2015-07-30 11:52:12 -07:00
Brian Behlendorf
11f552fa90 Update arc_available_memory() to check freemem
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages.  This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim.  Conceptually
this brings arc_available_memory() back in sync with illumos.

It is also desirable that the target amount of free memory be tunable
on a system.  While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.

zfs_arc_sys_free - The target number of bytes the ARC should leave
                   as free memory on the system.  This value can
                   checked in /proc/spl/kstat/zfs/arcstats and
                   setting this module option will override the
                   default value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3637
2015-07-30 11:50:22 -07:00
Brian Behlendorf
6339c1b9dc Bound zvol_threads module option
The zvol_threads module option should be bounded to a reasonable
range.  The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most.  The default value of 32 is a reasonable
default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3614
2015-07-29 07:42:11 -07:00
Chunwei Chen
21a96fb635 Fix "BUG: Bad page state" caused by writeback flag
Commit d958324 fixed the deadlock between page lock and range lock by
unlocking the page lock before acquiring the range lock. However,
this created a new issue #3075.

The problem is that if we can't set the write back bit before releasing
the page lock.  Then other processes will be unaware that the page is
under active write back.  They may therefore truncate the page,
invalidate the page, or not honor the sync semantics.

To workaround this problem we re-dirty the page before dropping the
page lock.  While this doesn't prevent the page from being truncated
it does ensure it won't be invalidated.  Then the range lock and the
page lock are reacquired in the correct deadlock-free order.

Once both locks are safely held the page state can be rechecked.  If
all is well and the page is in the expect state the dirty bit can be
removed, the write back bit set, and the page removed from the skip
count.  If not the page will be handled as appropriate.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3075
2015-07-29 07:38:15 -07:00
Brian Behlendorf
1229323d5f Align thread priority with Linux defaults
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority.  Non-I/O filesystem
processes run with the default priority.  ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active.  The priorities have been updated to the following:

$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
     -  TS 10743  19 -20 [spl_kmem_cache]
     -  TS 10744  19 -20 [spl_system_task]
     -  TS 10745  19 -20 [spl_dynamic_tas]
     -  TS 10764  19   0 [dbu_evict]
     -  TS 10765  19   0 [arc_prune]
     -  TS 10766  19   0 [arc_reclaim]
     -  TS 10767  19   0 [arc_user_evicts]
     -  TS 10768  19   0 [l2arc_feed]
     -  TS 10769  39   0 [z_unmount]
     -  TS 10770  39 -20 [zvol]
     -  TS 11011  39 -20 [z_null_iss]
     -  TS 11012  39 -20 [z_null_int]
     -  TS 11013  39 -20 [z_rd_iss]
     -  TS 11014  39 -20 [z_rd_int_0]
     -  TS 11022  38 -19 [z_wr_iss]
     -  TS 11023  39 -20 [z_wr_iss_h]
     -  TS 11024  39 -20 [z_wr_int_0]
     -  TS 11032  39 -20 [z_wr_int_h]
     -  TS 11033  39 -20 [z_fr_iss_0]
     -  TS 11041  39 -20 [z_fr_int]
     -  TS 11042  39 -20 [z_cl_iss]
     -  TS 11043  39 -20 [z_cl_int]
     -  TS 11044  39 -20 [z_ioctl_iss]
     -  TS 11045  39 -20 [z_ioctl_int]
     -  TS 11046  39 -20 [metaslab_group_]
     -  TS 11050  19   0 [z_iput]
     -  TS 11121  38 -19 [z_wr_iss]

Note that under Linux the meaning of a processes priority is inverted
with respect to illumos.  High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

This patch depends on https://github.com/zfsonlinux/spl/pull/466.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3607
2015-07-28 13:36:47 -07:00
Brian Behlendorf
c97d30691c Check for NULL in dmu_free_long_range_impl()
A NULL should never be passed as the dnode_t pointer to the function
dmu_free_long_range_impl().  Regardless, because we have a reported
occurrence of this let's add some error handling to catch this.
Better to report a reasonable error to caller than panic the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3445
2015-07-28 13:30:53 -07:00
Brian Behlendorf
96c080cb9c Minor style cleanup
Address minor differences in style between upstream and ZoL.  This
patch contains no functional differences and is solely designed to
minimize the delta from upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:54 -07:00
Brian Behlendorf
3056818343 Remove double counting HDR_L2ONLY_SIZE
Commit d962d5d didn't quite properly resolve the HDR_L2ONLY_SIZE
accounting.  Accounting is now performed only in the constructor
and destructor which is a nice simplification.  It should have
been removed the from create and destroy functions.  This brings
up back in sync with upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:44 -07:00
Brian Behlendorf
8c8af9d807 Add hdr_recl() reclaim callback
Originally removed because it wasn't required under Linux.  However,
there may still be some utility in signaling the arc reclaim thread
under Linux via reclaim.  This should already have happened by other
means but it's not harmless and reduces another point of divergence
with upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:40 -07:00
Brian Behlendorf
728d6ae91e Reinstate zfs_arc_p_min_shift
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c".  This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).

This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream.  Therefore
this code is being restored to facilitate benchmarking.  By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:32 -07:00
Prakash Surya
36da08ef9b Illumos 5817 - change type of arcs_size from uint64_t to refcount_t
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5817
  https://github.com/illumos/illumos-gate/commit/2fd872a

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:28 -07:00
Prakash Surya
500445c046 Illumos 5445 - Add more visibility via arcstats
5445 Add more visibility via arcstats; specifically arc_state_t
stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5445
  https://github.com/illumos/illumos-gate/commit/4076b1b

Porting Notes:

This patch is an improved version of cc7f677 which was previously
merged in ZoL.  This patch incorporates the additional improvements
which were made upstream.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:06 -07:00
Matthew Ahrens
ca67b33aba Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5376
  https://github.com/illumos/illumos-gate/commit/2ec99e3

Porting Notes:

The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.

* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back.  This function
has been customized for Linux so that it can be used to determine a low
memory.  This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base.  The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.

* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167.  Since there were related changes to this
function it made sense to include that work.

* The arc_init() function has been re-factored.  As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options.  The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3616
Closes #2167
2015-07-23 09:41:28 -07:00
Brian Behlendorf
53b1d9794e Add logic to try and recover an inode with an invalid mode
When an inode is detected with invalid mode bits the safe thing to
do is panic the system.  This indicates a problem with the contents
of a dnode and it should never be possible.  This is the default
behavior.

Unfortunately, due to flaws in the system attribute (SA) implementation
(on all platforms) it was possible that ZFS could create a damaged dnode.
This was a rare issue which only impacted dnodes which used a spill
block.  Normally only symlinks and files with ACLs would require a
spill block.  However, if the dataset had the xattr=sa property set
and extended attributes were used this problem could occur.

As of the 0.6.4 tag the root cause of this issue has been fixed.  For
pools which are exhibiting this damage the 'zfs_recover=1' module option
may be set.  This will cause ZFS to interpret the dnode with invalid
mode bits as a normal file.  This may allow the files to be accessed
for recovery purposes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3548
2015-07-17 15:33:35 -07:00
Turbo Fredriksson
47a4a6fd5f Support parallel build trees (VPATH builds)
Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure \
    --with-spl=$HOME/src/git/spl/ \
    --with-spl-obj=$HOME/src/git/spl/build
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1082
2015-07-17 13:42:51 -07:00
Brian Behlendorf
2a53e2dacc Update inode under range lock
After a successful write the inode must be updated under the range
lock.  If it is updated after dropping the lock there exists a race
where the znode and inode wile disagree about the file size.  This
could result in narrow window of time where read(2) is able to access
data beyond what fstat(2) reports as the file size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3601
2015-07-17 09:18:22 -07:00
Brian Behlendorf
bd29109f1a Linux 4.2 compat: follow_link() / put_link()
As of Linux 4.2 the kernel has completely retired the nameidata
structure.  One of the few remaining consumers of this interface
were the follow_link() and put_link() callbacks.

This patch adds the required checks to configure to detect the
interface change and updates the functions accordingly.  Migrating
to the simple_follow_link() interface was considered but was decided
against ironically due to the increased complexity.

It also should be noted that the kernel follow_link() and put_link()
interfaces changes several times after 4.1 and but before 4.2.  This
means there is a narrow range of kernel commits which never appear
in an official tag of the Linux kernel which ZoL will not build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3596
2015-07-17 09:18:16 -07:00
Brian Behlendorf
7eb333fbdd Linux 4.2 compat: remove bio->bi_cnt access
Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to
bio->__bi_cnt.  Because this value is only used once in a block of
debug code it simplest just to remove the PANIC.  To my knowledge
this debugging has never been hit or proved useful so this is no
great loss.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3596
2015-07-17 09:16:08 -07:00
Matthew Ahrens
905edb405d Illumos 5347 - idle pool may run itself out of space
5347 idle pool may run itself out of space
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/231aab8
  https://github.com/illumos/illumos-gate/commit/4a92375 3642
  https://www.illumos.org/issues/5347
  https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix)
  https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390
  https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643

Porting notes:
This is completing the partial fix from FreeBSD

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3586
2015-07-14 10:35:21 -07:00
Alexander Eremin
1cddb8c9ff Illumos 5610 - zfs clone from different source and target pools produces coredump
5610 zfs clone from different source and target pools produces coredump
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/03b1c29
  https://www.illumos.org/issues/5610
  https://www.illumos.org/issues/5824
  https://github.com/zfsonlinux/zfs/issues/2911
  https://github.com/zfsonlinux/zfs/commit/9063f65

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3584
2015-07-14 10:27:46 -07:00
Richard Yao
0de7c552b6 Failure of userland copy should return EFAULT
Many key internal functions pass system return codes that are safe to
return to userland. In the case of ddi_copyin(9F), an error passes -1
and the documentation states very clearly that drivers should pass
EFAULT to userland when this happens.

http://illumos.org/man/9F/ddi_copyin

This does not happen in the ZFS source code. I believe it should be
changed to pass EFAULT. I caught this when writing man pages for the
libzfs_core API.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3575
2015-07-14 10:20:35 -07:00
Boris Protopopov
b39c22b73c Translate sync zio to sync bio
Translate zio requests with ZIO_PRIORITY_SYNC_READ and
ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting
READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns
out to have a pronounced effect when writing to an SSD-based SLOG.

When WRITE_SYNC is not set (WRITE is set instead), the block trace
for a SLOG device looks as follows:
...
130,96   0        3     0.008968390     0  C   W 830464 + 136 [0]
130,96   0        4     0.011999161     0  C   W 830720 + 136 [0]
130,96   0        5     0.023955549     0  C   W 831744 + 136 [0]
130,96   0        6     0.024337663 19775  A   W 832000 + 136 <- (130,97) 829952
130,96   0        7     0.024338823 19775  Q   W 832000 + 136 [z_wr_iss/6]
130,96   0        8     0.024340523 19775  G   W 832000 + 136 [z_wr_iss/6]
130,96   0        9     0.024343187 19775  P   N [z_wr_iss/6]
130,96   0       10     0.024344120 19775  I   W 832000 + 136 [z_wr_iss/6]
130,96   0       11     0.026784405     0 UT   N [swapper] 1
130,96   0       12     0.026805339   202  U   N [kblockd/0] 1
130,96   0       13     0.026807199   202  D   W 832000 + 136 [kblockd/0]
130,96   0       14     0.026966948     0  C   W 832000 + 136 [0]
130,96   3        1     0.000449358 19788  A   W 829952 + 136 <- (130,97) 827904
130,96   3        2     0.000450951 19788  Q   W 829952 + 136 [z_wr_iss/19]
130,96   3        3     0.000453212 19788  G   W 829952 + 136 [z_wr_iss/19]
130,96   3        4     0.000455956 19788  P   N [z_wr_iss/19]
130,96   3        5     0.000457076 19788  I   W 829952 + 136 [z_wr_iss/19]
130,96   3        6     0.002786349     0 UT   N [swapper] 1
...

Here the 130,197 is the partition created on the log device when adding it
to the pool, whereas the base device is 130,96. As one can see, the writes
to the SLOG are not marked synchronous (the S is missing next to W), and
the queue unplugs occur based on the timer (UT event) resulting in slightly
over 2 msec latency of writes. This results in a sub-par performance of
single stream synchronous writes (limited by latency of the SLOG).

When the WRITE_SYNC is set, a similar trace looks as follows:
...
130,96   4        1     0.000000000 70714  A  WS 4280576 + 136 <- (130,97) 4278528
130,96   4        2     0.000000832 70714  Q  WS 4280576 + 136 [(null)]
130,96   4        3     0.000002109 70714  G  WS 4280576 + 136 [(null)]
130,96   4        4     0.000003394 70714  P   N [(null)]
130,96   4        5     0.000003846 70714  I  WS 4280576 + 136 [(null)]
130,96   4        6     0.000004854 70714  D  WS 4280576 + 136 [(null)]
130,96   5        1     0.000354487 70713  A  WS 4280832 + 136 <- (130,97) 4278784
130,96   5        2     0.000355072 70713  Q  WS 4280832 + 136 [(null)]
130,96   5        3     0.000356383 70713  G  WS 4280832 + 136 [(null)]
130,96   5        4     0.000357635 70713  P   N [(null)]
130,96   5        5     0.000358088 70713  I  WS 4280832 + 136 [(null)]
130,96   5        6     0.000359191 70713  D  WS 4280832 + 136 [(null)]
130,96   0       76     0.000159539     0  C  WS 4280576 + 136 [0]
130,96  16       85     0.000742108 70718  A  WS 4281088 + 136 <- (130,97) 4279040
130,96  16       86     0.000743197 70718  Q  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       87     0.000744450 70718  G  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       88     0.000745817 70718  P   N [z_wr_iss/15]
130,96  16       89     0.000746705 70718  I  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       90     0.000747848 70718  D  WS 4281088 + 136 [z_wr_iss/15]
130,96   0       77     0.000604063     0  C  WS 4280832 + 136 [0]
130,96   0       78     0.000899858     0  C  WS 4281088 + 136 [0]

As one can see, all the writes are synchronous (WS), and I/O completions
(e.g. from issue I to completion C) take 160-250 usec, or about 10x faster.

Since WRITE_SYNC or READ_SYNC flags are among several factors that are
considered when processing bio requests, it seems prudent to mark all the
zio requests of synchronous priority with the READ/WRITE_SYNC flags to make
them eligible for consideration as such by the Linux block I/O layer.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3529
2015-07-13 14:28:50 -07:00
Brian Behlendorf
2b7b78fa5d Fix switch-bool warning
As of gcc version 5.1.1 a new warning has been added to detect the
use of a boolean in a switch statement (-Wswitch-bool).  Resolve the
warning by explicitly casting the value to an integer type.

  zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request':
  error: switch condition has boolean value [-Werror=switch-bool]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-07-13 13:03:01 -07:00
Justin T. Gibbs
99197f034e Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://github.com/illumos/illumos-gate/commit/db1741f
  https://www.illumos.org/issues/5661

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3571
2015-07-10 12:11:45 -07:00
Tim Chase
1cd777340b Prevent reclaim in metaslab preload threads
Reclaim during metaslab preloading can cause deadlocks involving znode
z_lock and ARC buffer header ht_lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3532.
2015-07-06 09:36:13 -07:00
Alexander Motin
e16b3fcc61 Illumos 5008 - lock contention (rrw_exit) while running a read only load
5008 lock contention (rrw_exit) while running a read only load
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Porting notes:

This patch ported perfectly cleanly to ZoL.  During testing 100% cached
small-block reads, extreme contention was noticed on rrl->rr_lock from
rrw_exit() due to the frequent entering and leaving ZPL.  Illumos picked
up this patch from FreeBSD and it also helps under Linux.

On a 1-minute 4K cached read test with 10 fio processes pinned to a single
socket on a 4-socket (10 thread per socket) NUMA system, contentions on
rrl->rr_lock were reduced from 508799 to 43085.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3555
2015-07-06 09:34:13 -07:00
Matthew Ahrens
4bda3bd0e7 Illumos 5911 - ZFS "hangs" while deleting file
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5911
  https://github.com/illumos/illumos-gate/commit/46e1baa

Porting notes:

Resolved ISO C90 forbids mixed declarations and code wanting in
the dnode_free_range() function.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3554
2015-07-06 09:31:42 -07:00
Arne Jansen
5e8cd5d17f Illumos 5981 - Deadlock in dmu_objset_find_dp
5981 Deadlock in dmu_objset_find_dp
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5981
  https://github.com/illumos/illumos-gate/commit/1d3f896

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3553
2015-07-06 09:31:35 -07:00
Andriy Gapon
71e2fe41be Illumos 5946, 5945
5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/5946
  https://www.illumos.org/issues/5945
  https://github.com/illumos/illumos-gate/commit/24218be

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3552
2015-07-06 09:31:30 -07:00
Andriy Gapon
b6640117f0 Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5870
  https://github.com/illumos/illumos-gate/commit/beddaa9

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3551
2015-07-06 09:22:18 -07:00
Andriy Gapon
fec417097b Illumos 5909 - ensure that shared snap names don't become too long after promotion
5909 ensure that shared snap names don't become too long after promotion
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5909
  https://github.com/illumos/illumos-gate/commit/cb5842f

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3550
2015-07-06 09:21:30 -07:00
Andriy Gapon
cf50a2b08f Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot
5912 full stream can not be force-received into a dataset if it has a snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5912
  https://github.com/illumos/illumos-gate/commit/5bae108

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3549
2015-07-06 09:20:18 -07:00
Alek Pinchuk
a7b10a9319 Illumos 6033 - arc_adjust() should search MFU lists
6033 arc_adjust() should search MFU lists for oldest
buffer when adjusting MFU size
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/6033
  https://github.com/illumos/illumos-gate/commit/31c46cf

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3545
2015-07-01 11:09:15 -07:00
Matthew Ahrens
804e050457 Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performance
5175 implement dmu_read_uio_dbuf() to improve cached read performance
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5175
  https://github.com/illumos/illumos-gate/commit/f8554bb

Porting notes:

This patch doesn't include the changes for the COMSTAR (Common
Multiprotocol SCSI Target) - since it's not available for ZoL.

http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3392
2015-06-29 14:33:23 -07:00
Matthew Ahrens
c52fca13a0 Illumos 5368 - ARC should cache more metadata
5368 ARC should cache more metadata
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5368
  https://github.com/illumos/illumos-gate/commit/3a5286a

Porting Notes:

The vast majority of this patch was already merged in the context
of the 06358ea changes.  This is just a small hunk which was missed.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:58:17 -07:00
George Wilson
669dedb33f Illumos 5163 - arc should reap range_seg_cache
5163 arc should reap range_seg_cache
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5163
  https://github.com/illumos/illumos-gate/commit/83803b5

Porting Notes:

Added umem_cache_reap_now() wrapped to suppress unused variable
warning for user space build in arc_kmem_reap_now().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:58:16 -07:00
Brian Behlendorf
aa9af22cdf Update all default taskq settings
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos.  In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.

With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs.  This allows us to get
the best of both worlds.

* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads.  The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8).  At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.

* Reduces the number of idle threads on the system when it's not
under heavy load.  The maximum number of threads will only be
created when they are required.

* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads.  Again this
brings us back inline with upstream.

* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs.  The Linux taskq implementation supports this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3507
2015-06-25 08:58:16 -07:00
Andriy Gapon
ef56b0780c Account for ashift when gathering buffers to be written to l2arc device
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a9.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a9
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3400
Closes #3433
Closes #3451
2015-06-25 08:57:16 -07:00
Prakash Surya
d962d5dad9 Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5701
  https://github.com/illumos/illumos-gate/commit/a52fc31

Porting Notes:

arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr).

Ported by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:51:44 -07:00
Richard Yao
72540ea314 zfsdev_getminor() should check for invalid file handles
Unit testing at ClusterHQ found that passing an invalid file handle to
zfs_ioc_hold results in a NULL pointer dereference on a system without
assertions:

IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs]
Call Trace:
[<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs]
[<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs]
[<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs]

An assertion would have caught this had they been enabled, but this is
something that the kernel module should handle without failing.  We
resolve this by searching the linked list to ensure that the file
handle's private_data points to a valid zfsdev_state_t.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3506
2015-06-22 17:02:13 -07:00
Etienne Dechamps
99b14de421 Make metaslab_aliquot a module parameter.
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.

Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-22 14:19:38 -07:00
Etienne Dechamps
e8fe6684a5 Document metaslab_aliquot.
Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-22 14:19:31 -07:00
Etienne Dechamps
bb3250d07e Allocate disk space fairly in the presence of vdevs of unequal size.
The metaslab allocator device selection algorithm contains a bias
mechanism whose goal is to achieve roughly equal disk space usage across
all top-level vdevs.

It seems that the initial rationale for this code was to allow newly
added (empty) vdevs to "come up to speed" faster in an attempt to make
the pool quickly converge to a steady state where all vdevs are equally
utilized.

While the code seems to work reasonably well for this use case, there
is another scenario in which this algorithm fails miserably: the case
where top-level vdevs don't have the same sizes (capacities). ZFS
allows this, and it is a good feature to have, so that users who simply
want to build a pool with the disks they happen to have lying around can
do so even if the disks have heteregenous sizes.

Here's a script that simulates a pool with two vdevs, with one 4X larger
than the other:

    dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728
    dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912
    zpool create testspace /tmp/d1 /tmp/d2
    dd if=/dev/zero of=/testspace/foobar bs=1M count=256
    zpool iostat -v testspace

Before this commit, the script would output the following:

                   capacity
    pool        alloc   free
    ----------  -----  -----
    testspace    252M   375M
      /tmp/d1    104M  18.5M
      /tmp/d2    148M   356M
    ----------  -----  -----

This demonstrates that the current code handles this situation very
poorly: d1 shows 85% usage despite the pool itself being only 40% full.
d1 is quite saturated at this point, and is slowing down the entire pool
due to saturation, fragmentation and the like.

In contrast, here's the result with the code in this commit:

                   capacity
    pool        alloc   free
    ----------  -----  -----
    testspace    252M   375M
      /tmp/d1   56.7M  66.3M
      /tmp/d2    195M   309M
   ----------  -----  ------

This looks much better. d1 is 46% used, which is close to the overall
pool utilization (40%). The code still doesn't result in perfectly
balanced allocation, probably because of the way mg_bias is applied
which does not guarantee perfect accuracy, but this is still much better
than before.

Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3389
2015-06-22 14:18:29 -07:00
Brian Behlendorf
218b4e0a76 Add zfs_sb_prune_aliases() function
For kernels which do not implement a per-suberblock shrinker,
those older than Linux 3.1, the shrink_dcache_parent() function
was used to attempt to reclaim dentries.  This was found not be
entirely reliable and could lead to performance issues on older
kernels running meta-data heavy workloads.

To address this issue a zfs_sb_prune_aliases() function has been
added to implement this functionality.  It relies on traversing
the list of znodes for a filesystem and adding them to a private
list with a reference held.  The private list can then be safely
walked outside the z_znodes_lock to prune dentires and drop the
last reference so the inode can be freed.

This provides the same synchronous behavior as the per-filesystem
shrinker and has the advantage of depending on only long standing
interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3501
2015-06-22 10:22:49 -07:00
Brian Behlendorf
4c6a700910 Increase the number of iput taskq threads
The number of threads in the iput taskq has been increased to speed
up the number of iputs which can be handled.  This has been observed
to improve the  meta data reclaim regardless of zfs_sb_prune()
implementation in use.

The taskq has also been renamed z_iput to for consistency with the
rest of the I/O pipeline taskqs which are all named z_*.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
2015-06-22 10:22:10 -07:00
Matus Kral
57ae840077 Linux 4.1 compat: use read_iter() / write_iter()
Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods.
The ->read_iter() and ->write_iter() methods were designed to replace
the ->aio_read() and ->aio_write() interfaces.  Both interfaces were
preserved for several kernel releases in order to migrate all existing
consumers to the new interfaces.  But as of Linux 4.1 the legacy
interface has been retired and the ZFS code must be updated to use
the new interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3352
2015-06-18 12:06:59 -07:00
Tim Chase
90947b2357 3.12 compat, NUMA-aware per-superblock shrinker
Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in
ZoL by zfs_sb_prune().  This patch calls the shrinker for each on-line
NUMA node in order that memory be freed for each one.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3495
2015-06-17 10:43:13 -07:00
Brian Behlendorf
8e70975f90 Wait interruptibly in prefetch thread
The Linux kernel watchdog will automatically dump a backtrace for
any process while sleeps for over 120s in an uninterruptible state.

The solution is for the prefetch thread to sleep in an interruptible
state.  The way the existing code was written this is safe because
when woken it will always reevaluate its conditional.  As a general
rule it is preferable to sleep in an interruptible when possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3450
Closes #3402
2015-06-16 16:18:11 -07:00
Brian Behlendorf
b64ccd6c52 Rename cv_wait_interruptible() to cv_wait_sig()
This is the counterpart to zfsonlinux/spl@2345368 which replaces the
cv_wait_interruptible() function with cv_wait_sig().  There is no
functional change to patch merely brings the function names in to
sync to maximize portability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3450
Issue #3402
2015-06-11 10:50:47 -07:00
Tim Chase
121b3cae74 Increase arc_c_min to allow safe operation of arc_adapt()
ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny
systems such as the raspberry pi, however, as of addition of large block
support, the arc_adapt() function depends on arc_c being >= 32MiB (2 *
SPA_MAXBLOCKSIZE).

This patch raises the minimum ARC size to 32MiB and adds a VERIFY test
to arc_adapt() for future-proofing.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
f604673836 Make arc_prune() asynchronous
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock.  This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous.  Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.

To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:

* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq.  This makes it
suitable to use in the context of the arc_adapt_thread().

* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete.  This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.

This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured.  It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
c5528b9ba6 Use taskq_wait_outstanding() function
Replace taskq_wait() with taskq_wait_oustanding().  This way callers
will only block until previously submitted tasks have been completed.
This was the previous behavior of task_wait() prior to the introduction
of taskq_wait_outstanding() so this isn't really a functionalty change
for these callers.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Prakash Surya
ca0bf58d65 Illumos 5497 - lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Porting notes and other significant code changes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.

The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.

Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:

    302f753 - Integrate ARC more tightly with Linux
    39e055c - Adjust arc_p based on "bytes" in arc_shrink
    f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
    77765b5 - Remove "arc_meta_used" from arc_adjust calculation
    94520ca - Prune metadata from ghost lists in arc_adjust_meta

Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:

    fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
    fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }

The following arcstats have been removed:

    recycle_miss - Used by arcstat.py and arc_summary.py, both of which
    have been updated appropriately.

    l2_writes_hdr_miss

The following arcstats have been added:

    evict_not_enough - Number of times arc_evict_state() was unable to
    evict enough buffers to reach its target amount.

    evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
    because it was being written to the l2arc.

    l2_writes_lock_retry - Replaces l2_writes_hdr_miss.  Number of times
    l2arc_write_done() failed to acquire hash_lock (and re-tries).

    arc_meta_min - Shows the value of the zfs_arc_meta_min module
    parameter (see below).

The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme.  Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.

The following module parameters have been added:

    zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
    before moving on to the next sub-list.

    zfs_arc_meta_min - Enforce a floor on the amount of metadata in
    the ARC.

    zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
    ARC state.

    zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
    the target size to be considered "overflowing".

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
2015-06-11 10:27:25 -07:00
Chris Williamson
b9541d6b7d Illumos 5408 - managing ZFS cache devices requires lots of RAM
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Porting notes:

Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:

    6e1d7276c9
    Fix inaccurate arcstat_l2_hdr_size calculations

        The ARC_SPACE_HDRS constant no longer exists and has been
        somewhat equivalently replaced by HDR_L2ONLY_SIZE.

    e0b0ca983d
    Add visibility in to cached dbufs

        The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
        struct requires additional structure member names to be used
        when referencing the inner items.  Also, the presence of L1 or L2
        inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
        macros.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
George Wilson
2a4324141f Illumos 5369 - arc flags should be an enum
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Porting notes:

ZoL has moved some ARC definitions into arc_impl.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Tim Chase <tim@chase2k.com>
2015-06-11 10:27:25 -07:00
Tim Chase
ad4af89561 Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches"
This reverts only the l2arc_hdr part of commit
ecf3d9b8e6 in preparation for the illumos
5497 "lock contention on arcs_mtx" patch which does the same thing
but uses the newer two-level ARC structure following the Illumos 5408
"managing ZFS cache devices requires lots of RAM" patch.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Tim Chase
97639d0a52 Revert "Allow arc_evict_ghost() to only evict meta data"
Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates
the need for this.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Tim Chase
f6b3b1f5d6 Revert "fix l2arc compression buffers leak"
This reverts commit 037763e44e in
preparation for the illumos 5497 "lock contention on arcs_mtx" patch
which includes a fix for this very problem.

ZoL had picked up a subset of the illumos 5497 patch to deal with the
l2arc compression buffer leak.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:24 -07:00
Tim Chase
7807028ccd Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc"
This reverts commit 16fcdea363 in preparation
for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates
"marker" within the ARC code.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:24 -07:00
Brian Behlendorf
44de2f02d6 Remove unused variable in vdev_add_child()
Commit c3520e7 restructured vdev_add_child() in such a way that
the spa variable was unused during non-debug builds.  This is
consistent with the upstream illumos code but because ZoL, unlike
illumos, is built with all compiler warnings enabled this causes
a legitimate warning.  Revert this hunk of the patch to keep the
build clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3432
2015-06-11 10:22:38 -07:00
Matthew Ahrens
c3520e7f1f Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector size
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Albert Lee <trisk@omniti.com>

References:
  https://www.illumos.org/issues/5818
  https://github.com/illumos/illumos-gate/commit/81cd5c5

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3432
2015-06-10 16:24:01 -07:00
Arne Jansen
9c43027b3f Illumos 5269 - zpool import slow
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5269
  https://github.com/illumos/illumos-gate/commit/12380e1e

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3396
2015-06-09 13:48:02 -07:00
Ned Bass
5f8e1e8505 dmu_objset_userquota_get_ids uses dn_bonus unsafely
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3443
2015-06-05 12:41:17 -07:00
Ned Bass
d617648c7f dbuf_try_add_ref minor bug fixes
- Don't check db->bb_blkid, but use the blkid argument instead.
  Checking db->db_blkid may be unsafe since we doesn't yet have a
  hold on the dbuf so its validity is unknown.

- Call mutex_exit() on found_db, not db, since it's not certain that
  they point to the same dbuf, and the mutex was taken on found_db.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
2015-06-05 12:40:38 -07:00
Matthew Ahrens
e5fd1dd682 Illumos 5243 - zdb -b could be much faster
5243 zdb -b could be much faster
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5243
  https://github.com/illumos/illumos-gate/commit/f7950bf

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3414
2015-05-15 11:14:54 -07:00
Jan Sanislo
79065ed5a4 Return -ESTALE to force lookup for missing NFS file handles
There seems to be a annoying problem using NFSv4 to access ZFS
file systems under certain circumstances.  It's easily reproduced:

    nfs_client1:  mount server:/export /mnt
    nfs_client1:  cd /mnt
    nfs_client1:  echo foo >junk
    nfs_client1:  cat junk
    foo

Now on a different NFSv4 client:

    nfs_client2:  mount server:/export /mnt
    nfs_client2:  cd /mnt
    nfs_client2:  vi junk
    # Make some changes to /mnt/junk and save
    # This change the inode associated with /mnt/junk

Now back to the original client:

    nfs_client1:  cat junk
    cat: junk: No such file or directory

Admittedly NFSv4 is not advertised as a cluster file system that
maintains a completely coherent view of data across multiple nodes.
But it does have some mechanisms built in that try to deal with
situations like the above.  Namely, it employs specialized file
handle lookup routines that return ESTALE when a file handle contains
a non-existant inode value.  The ESTALE return triggers a return
full file path lookup from the client to determine if the file has
actually gone away or if the cached file handle is no longer valid.
ZFS behavior can be brought into line with other file systems
(e.g., ext4) by applying the following patch:

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3224
2015-05-14 11:16:52 -07:00
Antonio Russo
7290cd3c4e Relax restriction on zfs_ioc_next_obj() iteration
Per the documentation for dnode_next_offset in dnode.c, the "txg"
parameter specifies a lower bound on which transaction the dnode can
be found in. We are interested in all dnodes that are removed between
the first and last transaction in the snapshot. It doesn't need to be
created in that snapshot to correspond to a removed file.

In fact, the behavior of zfs diff in the test case exactly matches
this: the transaction that created the data that was deleted in snapshot
"2" was produced before, in snapshot "1", definitely predating the first
transaction in snapshot "2".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com>
Closes #2081
2015-05-14 11:16:08 -07:00
Brian Behlendorf
fd0fd6467b Remove unused 'dsl_pool_t *dp' variable
When ASSERTs are compiled out by using the --disable-debug configure
option.  Then the local variable 'dsl_pool_t *dp' will be unused and
generate a compiler warning.  Since this variable is only used once
in the ASSERT replace it with 'ds->ds_dir->dd_pool'.

This has the additional advantage of potentially saving a few bytes
on the stack depending on how gcc decides to compile the function.

This issue was not noticed immediately because the automated builders
use --enable-debug to make the testing as rigorous as possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #3410
2015-05-14 11:09:47 -07:00
Max Grossman
5dc8b7365f Illumos 5765 - add support for estimating send stream size with lzc_send_space when source is a bookmark
5765 add support for estimating send stream size with lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>

References:
  https://www.illumos.org/issues/5765
  https://github.com/illumos/illumos-gate/commit/643da460

Porting notes:
* Unused variable 'recordsize' in dmu_send_estimate() dropped

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3397
2015-05-13 09:03:59 -07:00
Justin T. Gibbs
19b3b1d2a2 Illumos 5393 - spurious failures from dsl_dataset_hold_obj()
5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5393
  https://github.com/illumos/illumos-gate/commit/e1f3c20

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3403
2015-05-13 08:56:04 -07:00
Justin T. Gibbs
63b33e878a Illumos 5562 - ZFS sa_handle's violate kmem invariants, debug kernels panic on boot
5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5562
  https://github.com/illumos/illumos-gate/commit/0fda3cc5

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3388
2015-05-11 15:10:57 -07:00
Matthew Ahrens
252e1a54ab Illumos 5810 - zdb should print details of bpobj
5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5810
  https://github.com/illumos/illumos-gate/commit/732885fc

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3387
2015-05-11 15:10:24 -07:00
Matthew Ahrens
10400bfeac Illumos 5351, 5352 - scrub pauses
5351 scrub goes for an extra second each txg
5352 scrub should pause when there is some dirty data

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5351
  https://github.com/illumos/illumos-gate/commit/6f6a76a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3383
2015-05-11 15:09:56 -07:00
Matthew Ahrens
08dc1b2ddd Illumos 5350 - clean up code in dnode_sync()
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5350
  https://github.com/illumos/illumos-gate/commit/e651831

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3382
2015-05-11 15:09:51 -07:00
Alex Reece
7224c67fea Illumos 5422 - preserve AVL invariants in dn_dbufs
Author: Alex Reece <alex@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5422
  https://github.com/illumos/illumos-gate/commit/a846f19

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3381
2015-05-11 15:09:29 -07:00
David Lamparter
214806c7e9 Safely handle security / ACL failures
The security and ACL operations should all be performed atomically.
To accomplish this there would need to significant invasive changes
made to the common code base.  For the moment it's desirable for
compatibility reasons to avoid this.  Therefore the code has been
updated to attempt to unwind the operation in case of failure
rather than panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2445
2015-05-11 14:35:14 -07:00
Matthew Ahrens
f1512ee61e Illumos 5027 - zfs large block support
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
2015-05-11 12:23:16 -07:00
Brian Behlendorf
f9cab37291 Remove metaslab_min_alloc_size module option
The metaslab_min_alloc_size option is no longer used in the  code.
This functionality was removed by commit f3a7f66 and the module
options should have been dropped at that time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-11 09:51:57 -07:00
Chris Dunlop
16fcdea363 arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc
With debugging enabled and depending on your kernel config, the size of
arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost()
to greater than 1024 bytes. Let's avoid this.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3377
2015-05-08 14:14:35 -07:00
Matthew Ahrens
63e3a8616b Illumos 5349 - verify that block pointer is plausible before reading
5349 verify that block pointer is plausible before reading
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Xin Li <delphij@FreeBSD.org>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5349
  https://github.com/illumos/illumos-gate/commit/f63ab3d5

Porting notes:
* Several variable declarations were moved due to C style needs

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3373
2015-05-08 14:09:15 -07:00
Chris Dunlop
f0da4d1508 Wait for all znodes to be released before tearing down the superblock
By the time we're tearing down our superblock the VFS has started releasing
all our inodes/znodes. Some of this work may have been handed off to our
iput taskq so we need to wait for that work to complete. However the iput
from the taskq can itself result in additional work being added to the
taskq:

dsl_pool_iput_taskq
 iput
  iput_final
   evict
    destroy_inode
     zpl_inode_destroy
      zfs_inode_destroy
       zfs_iput_async(ZTOI(zp->z_xattr_parent))
        taskq_dispatch(dsl_pool_iput_taskq..., iput, ...)

Let's wait until all our znodes have been released.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3281
2015-05-06 14:13:19 -07:00
Matthew Ahrens
7a3066ffdd Illumos 5348 - zio_checksum_error() only fills in info if ECKSUM
5348 zio_checksum_error() only fills in info if ECKSUM
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5348
  https://github.com/illumos/illumos-gate/commit/373dc1cf

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3372
2015-05-05 11:05:37 -07:00
Matthew Ahrens
f3c517d814 Illumos 5820 - verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5820
  https://github.com/illumos/illumos-gate/commit/34e8acef00

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3364
2015-05-04 10:49:49 -07:00
Matthew Ahrens
36c6ffb6b6 Illumos 5808 - spa_check_logs is not necessary on readonly pools
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5808
  https://github.com/illumos/illumos-gate/commit/23367a2f

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3369
2015-05-04 10:45:42 -07:00
Will Andrews
50f9ea0149 Illumos 5814 - bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5814
  https://github.com/illumos/illumos-gate/commit/b67dde11

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3368
2015-05-04 10:28:18 -07:00
Christopher Siden
0c60cc326b Illumos 4951 - ZFS administrative commands (fix)
4951 ZFS administrative commands should use reserved space, not fail with ENOSPC
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/4951
  https://github.com/illumos/illumos-gate/commit/c39f2c8

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:10 -07:00
Matthew Ahrens
3d45fdd6c0 Illumos 4951 - ZFS administrative commands should use reserved space
4951 ZFS administrative commands should use reserved space, not with ENOSPC
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4373
  https://github.com/illumos/illumos-gate/commit/7d46dc6

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:10 -07:00
Jerry Jelinek
a0c9a17aef Illumos 4901 - zfs filesystem/snapshot limit leaks
4901 zfs filesystem/snapshot limit leaks
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4901
  https://github.com/illumos/illumos-gate/commit/adf3407

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:09 -07:00
Matthew Ahrens
83017311e4 Illumos 3654,3656
3654 zdb should print number of ganged blocks
3656 remove unused function zap_cursor_move_to_key()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3654
  https://www.illumos.org/issues/3656
  https://github.com/illumos/illumos-gate/commit/d5ee8a1

Porting Notes:

3655 and 3657 were part of this commit but those hunks were dropped
since they apply to mdb.

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:09 -07:00
tuxoko
6102d0376e Add cond_resched to zfs_zget to prevent infinite loop
It's been reported that threads would loop infinitely inside zfs_zget. The
speculated cause for this is that if an inode is marked for evict, zfs_zget
would see that and loop. However, if the looping thread doesn't yield, the
inode may not have a chance to finish evict, thus causing a infinite loop.

This patch solve this issue by add cond_resched to zfs_zget, making the
looping thread to yield when needed.

Tested-by: jlavoy <jalavoy@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3349
2015-05-04 09:12:41 -07:00
Jason Zaman
c9520ecc0f dmu: fix integer overflows
The params to the functions are uint64_t, but the offsets to memcpy
/ bcopy are calculated using 32bit ints. This patch changes them to
also be uint64_t so there isnt an overflow. PaX's Size Overflow
caught this when formatting a zvol.

Gentoo bug: #546490

PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000
PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15
CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P           O   3.17.7-hardened-r1 #1
Call Trace:
 [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs]
 [<ffffffff81a59c88>] dump_stack+0x4e/0x7a
 [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs]
 [<ffffffff81206696>] report_size_overflow+0x36/0x40
 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs]
 [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs]
 [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs]
 [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs]
 [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs]
 [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0
 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl]
 [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20
 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl]
 [<ffffffff810f7ff4>] kthread+0xc4/0xe0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170
 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170

Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3333
2015-05-04 09:12:00 -07:00
George Wilson
98b254188a Illumos #5244 - zio pipeline callers should explicitly invoke next stage
5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5244
  https://github.com/illumos/illumos-gate/commit/738f37b

Porting Notes:

1. The unported "2932 support crash dumps to raidz, etc. pools"
   caused a merge conflict due to a copyright difference in
   module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
   and additional Linux-specific changes caused merge conflicts in
   module/zfs/vdev_disk.c.

Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2828
2015-04-30 15:07:47 -07:00
Matthew Ahrens
8dd86a10cf Illumos 5812 - assertion failed in zrl_tryenter(): zr_owner==NULL
5812 assertion failed in zrl_tryenter(): zr_owner==NULL
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5812
  https://github.com/illumos/illumos-gate/commit/8df1730

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3357
2015-04-30 14:43:40 -07:00
Justin T. Gibbs
6186e29753 Illumos 5592 - NULL pointer dereference in dsl_prop_notify_all_cb()
5592 NULL pointer dereference in dsl_prop_notify_all_cb()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5592
  https://github.com/illumos/illumos-gate/commit/9d47dec

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:58 -07:00
Justin T. Gibbs
6ebebaceb1 Illumos 5531 - NULL pointer dereference in dsl_prop_get_ds()
5531 NULL pointer dereference in dsl_prop_get_ds()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5531
  https://github.com/illumos/illumos-gate/commit/e57a022

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:44 -07:00
Justin T. Gibbs
0c66c32d1d Illumos 5056 - ZFS deadlock on db_mtx and dn_holds
5056 ZFS deadlock on db_mtx and dn_holds
Author: Justin Gibbs <justing@spectralogic.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5056
  https://github.com/illumos/illumos-gate/commit/bc9014e

Porting Notes:

sa_handle_get_from_db():
  - the original patch includes an otherwise unmentioned fix for a
    possible usage of an uninitialised variable

dmu_objset_open_impl():
  - Under Illumos list_link_init() is the same as filling a list_node_t
    with NULLs, so they don't notice if they miss doing list_link_init()
    on a zero'd containing structure (e.g. allocated with kmem_zalloc as
    here). Under Linux, not so much: an uninitialised list_node_t goes
    "Boom!" some time later when it's used or destroyed.

dmu_objset_evict_dbufs():
  - reduce stack usage using kmem_alloc()

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:34 -07:00
Justin T. Gibbs
d683ddbb72 Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFS
5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5314
  https://github.com/illumos/illumos-gate/commit/c137962

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:20 -07:00
Justin T. Gibbs
945dd93525 Illumos 5310 - Remove always true tests for non-NULL ds->ds_phys
5310 Remove always true tests for non-NULL ds->ds_phys
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5310
  https://github.com/illumos/illumos-gate/commit/d808a4f

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:08 -07:00
Alex Reece
9925c28cde Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufs
5095 panic when adding a duplicate dbuf to dn_dbufs
Author: Alex Reece <alex@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5095
  https://github.com/illumos/illumos-gate/commit/86bb58a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:49 -07:00
Justin T. Gibbs
5aea3644d6 Illumos 5038 - Remove "old-style" flexible array usage in ZFS.
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5038
  https://github.com/illumos/illumos-gate/commit/7f18da4

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:24 -07:00
Alex Reece
8951cb8dfb Illumos 4873 - zvol unmap calls can take a very long time for larger datasets
4873 zvol unmap calls can take a very long time for larger datasets
Author: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4873
  https://github.com/illumos/illumos-gate/commit/0f6d88a

Porting Notes:

dbuf_free_range():
  - reduce stack usage using kmem_alloc()
  - the sorted AVL tree will handle the spill block case correctly
    without all the special handling in the for() loop

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:03 -07:00
Jorgen Lundman
58c4aa00c6 Illumos 4975 - missing mutex_destroy() calls in zfs
4975 missing mutex_destroy() calls in zfs
Author: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Seth Nimbosa <darth.Serious@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4975
  https://github.com/illumos/illumos-gate/commit/d2b3cbb

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:23:38 -07:00
Alex Reece
ca227e54a8 Illumos 3897 - zfs filesystem and snapshot limits (fix leak)
3897 zfs filesystem and snapshot limits (fix leak)
Author: Alex Reece <alex.reece@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/fb7001f

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:23:14 -07:00
Jerry Jelinek
788eb90c4c Illumos 3897 - zfs filesystem and snapshot limits
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/a2afb61

Porting Notes:

dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:22:51 -07:00
tuxoko
ecfb0b5f42 Fix misuse of input argument in traverse_visitbp
In traverse_visitbp(), the input argument dnp is modified in the middle to
point to a temporary buffer. Originally this doesn't matter, because no user
of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of code is added
dereferencing dnp after the modification, creating a possible bug.

We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2060
2015-04-28 09:43:50 -07:00
Isaac Huang
0336f3d001 Remove useless variable spa_active_count
This isn't required for the Linux port because the kernel tracks
if a module is busy.  The prototype for spa_busy() is also removed
since its definition was already removed.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3262
2015-04-27 09:22:05 -07:00
Justin T. Gibbs
ec8501ee12 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5313
  https://github.com/illumos/illumos-gate/commit/fe319232

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3280
2015-04-24 15:16:56 -07:00
Ned Bass
4eb30c6864 Serialize access to spa->spa_feat_stats nvlist
The function spa_add_feature_stats() manipulates the shared nvlist
spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to
protect the list.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3335
2015-04-24 15:04:43 -07:00
Chunwei Chen
07012da668 Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb)
The following panic would occur under certain heavy load:
[ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held
[ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P           OE  3.18.10 #7

The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which
would purge all tsd data for the thread. However, ZFS_ENTER is designed to be
reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data.

Instead, we rely on the new behavior of tsd_set. When NULL is passed as the
new value to tsd_set, it will automatically remove the tsd entry specified the
the key for the current thread.

rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when
they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to
do clean up. Now we explicitly call tsd_set(key, NULL) on them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3247
2015-04-24 14:57:54 -07:00
Brian Behlendorf
a438ff0e85 Extend PF_FSTRANS critical regions
Additional testing has shown that the region covered by PF_FSTRANS
needs to be extended to cover the  zpl_xattr_security_init() and
init_acl() functions.  The zpl_mark_dirty() function can also recurse
and therefore must always be protected.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3331
2015-04-24 09:54:22 -07:00
Brian Behlendorf
7fad6290eb Mark additional functions as PF_FSTRANS
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3225
2015-04-17 09:35:24 -07:00
Tim Chase
5074bfe8ad Allocate zfs_znode_cache on the Linux slab
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.

This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.

If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory.  If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3283
2015-04-14 12:19:22 -07:00
Brian Behlendorf
f42d7f4111 Use vmem_alloc() in spa_config_write()
The packed nvlist allocated in spa_config_write() may exceed the
warning threshold for large configurations.  Use the vmem interfaces
for this short lived allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3251
2015-04-07 15:10:19 -07:00
Tim Chase
40d06e3c78 Mark all ZPL and ioctl functions as PF_FSTRANS
Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl
calls as well as the l2arc and adapt ARC threads.

This obviates the need for MUTEX_FSTRANS so its previous uses and
definition have been eliminated.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3225
2015-04-03 11:38:59 -07:00
Matthew Ahrens
0f7d2a4b3d Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5693
  https://github.com/illumos/illumos-gate/commit/7f7ace3

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3231
2015-03-27 15:02:56 -07:00
George Wilson
b738bc5a0f Illumos 5694 - traverse_prefetcher does not prefetch enough
5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5694
  https://github.com/illumos/illumos-gate/commit/34d7ce05

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3230
2015-03-27 15:02:50 -07:00
Chris Dunlop
ee2f17aa2a Align code with Illumos
Align code in traverse_visitbp() with that in Illumos in preparation for
applying Illumos-5694.

No functional change: use a temporary variable pd to replace multiple
occurrences of td->td_pfd.  This increases our stack use slightly more
then normal because the function is called recursively.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3230
2015-03-27 14:52:45 -07:00
Prakash Surya
a4069eef2e Illumos 5695 - dmu_sync'ed holes do not retain birth time
5695 dmu_sync'ed holes do not retain birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5695
  https://github.com/illumos/illumos-gate/commit/70163ac

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3229
2015-03-27 14:51:34 -07:00
Ned Bass
58806b4cdc dbuf_free_range() overzealously frees dbufs
When called to free a spill block from a dnode, dbuf_free_range() has a
bug that results in all dbufs for the dnode getting freed.  A variety of
problems may result from this bug, but a common one was a zap lookup
tripping an ASSERT because the zap buffers had been zeroed out.  This
could happen on a dataset with xattr=sa set when extended attributes are
written and removed on a directory concurrently with I/O to files in
that directory.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #3195
Fixes #3204
Fixes #3222
2015-03-25 14:48:22 -07:00
Tim Chase
ded576e28f Set the maximum ZVOL transfer size correctly
ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it
the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS).
This cap was removed in torvalds/linux@34b48db.  This patch changes
it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in
dmu_tx_hold_write() to allow the maximum transfer size.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3212
2015-03-25 11:58:42 -07:00
Isaac Huang
e89bd69775 zio_injection_enabled should not be a module option
The zio_inject.c keeps zio_injection_enabled as a counter of
fault handlers, so it should not be exported to user space as
a module option.

Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c,
where the symbols are defined.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3199
2015-03-24 13:22:03 -07:00
Chris Dunlop
d07b7c7f21 Reduce size of zfs_sb_t: allocate z_hold_mtx separately
zfs_sb_t has grown to the point where using kmem_zalloc() for allocations
is triggering the 32k warning threshold.

We can't safely convert this entire allocation to use vmem_alloc() instead
of kmem_alloc() because the backing_dev_info structure is embedded here.
It depends on the bit_waitqueue() function which won't behave properly
when given a virtual address.

Instead, use vmem_alloc() to allocate the z_hold_mtx array separately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #3178
2015-03-24 13:17:44 -07:00
Brian Behlendorf
bc88866657 Fix arc_adjust_meta() behavior
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Brian Behlendorf
2cbb06b561 Restructure per-filesystem reclaim
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Brian Behlendorf
596a8935a1 Fix arc_meta_max accounting
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Chunwei Chen
40749aa7a6 Use MUTEX_FSTRANS on l2arc_buflist_mtx
Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock
scenario:
1. arc_release() -> hash_lock -> l2arc_buflist_mtx
2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) ->
   arc_buf_remove_ref() -> hash_lock

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3160
2015-03-18 09:29:38 -07:00
Justin T. Gibbs
4c7b7eedcd Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5630
  https://github.com/illumos/illumos-gate/commit/cd485b4

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek
73ad4a9f3c Illumos 5047 - don't use atomic_*_nv if you discard the return value
5047 don't use atomic_*_nv if you discard the return value
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5047
  https://github.com/illumos/illumos-gate/commit/640c167

Porting Notes:

Several hunks from the original patch where not specific to ZFS
and thus were dropped.

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:33 -07:00
Brian Behlendorf
7f3e466283 Mark zfs_inactive() with PF_FSTRANS
Allowing direct reclaim to re-enter the VFS in the zfs_inactive()
call path has historically been problematic for ZoL.  Therefore,
in order to avoid an entire class of current and future issues
caused by this PF_FSTRANS is set for all zfs_inactive() callers.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3163
2015-03-10 09:21:48 -07:00
Ned Bass
417104bdd3 Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa.  It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
2015-03-05 14:11:10 -08:00
Brian Behlendorf
989fd514b1 Change ASSERT(!"...") to cmn_err(CE_PANIC, ...)
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1445
2015-03-03 13:22:21 -08:00
Brian Behlendorf
8c45def24a Linux 4.0 compat: bdi_setup_and_register()
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed.  File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled.  A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.

* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.

I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported.  All kernels newer than
this will have one of the above interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3128
2015-03-03 10:49:45 -08:00
Brian Behlendorf
4ec15b8dcf Use MUTEX_FSTRANS mutex type
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by using a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch
'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3050
Closes #3055
Closes #3062
Closes #3132
Closes #3142
Closes #2983
2015-03-03 10:46:40 -08:00
Isaac Huang
d14cfd83da Fix deadlock between zpool export and zfs list
Pool reference count is NOT checked in spa_export_common()
if the pool has been imported readonly=on, i.e. spa->spa_sync_on
is FALSE. Then zpool export and zfs list may deadlock:

1. Pool A is imported readonly.
2. zpool export A and zfs list are run concurrently.
3. zfs command gets reference on the spa, which holds a dbuf on
   on the MOS meta dnode.
4. zpool command grabs spa_namespace_lock, and tries to evict dbufs
   of the MOS meta dnode. The dbuf held by zfs command can't be
   evicted as its reference count is not 0.
5. zpool command blocks in dnode_special_close() waiting for the
   MOS meta dnode reference count to drop to 0, with
   spa_namespace_lock held.
6. zfs command tries to get the spa_namespace_lock with a reference
   on the spa held, which holds a dbuf on the MOS meta dnode.
7. Now zpool command and zfs command deadlock each other.

Also any further zfs/zpool command will block on spa_namespace_lock
forever.

The fix is to always check pool reference count in spa_export_common(),
no matter whether the pool was imported readonly or not.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2034
2015-03-02 11:50:06 -08:00
Brian Behlendorf
87a63dd702 Prevent "zpool destroy|export" when suspended
Cleanly destroying or exporting a pool requires that the pool
not be suspended.  Therefore, set the POOL_CHECK_SUSPENDED flag
for these ioctls so the utilities will output a descriptive
error message rather than block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2878
2015-03-02 11:50:06 -08:00
Brian Behlendorf
b4f3666a16 Retire spl_module_init()/spl_module_fini()
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup.  This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2985
2015-02-24 11:37:44 -08:00
Brian Behlendorf
1efdc45ea8 Fix O_APPEND open(2) flag
As described in flags section of open(2):

  O_APPEND:
    The  file  is  opened in append mode.  Before each write(2), the
    file offset is positioned at the end of the  file,  as  if  with
    lseek(2).   O_APPEND may lead to corrupted files on NFS filesys-
    tems if more than one process appends data to a  file  at  once.
    This is because NFS does not support appending to a file, so the
    client kernel has to simulate it, which can't be done without  a
    race condition.

This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem.  However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3124
2015-02-24 11:21:54 -08:00
Dan Swartzendruber
1611bb7b4f Set zfs_autoimport_disable default value to 1
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file.  This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.

1) modprobe
2) zpool import
3) zfs mount
4) zfs share

This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side.  It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2820
2015-02-17 16:09:41 -08:00
Brian Behlendorf
7d2868d5fc Skip bad DVAs during free by setting zfs_recover=1
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal.  This indicates that somehow a damaged
DVA was written to disk and that should be impossible.

However, we have seen a handful of reports over the years of pools
somehow being damaged in this way.  Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA.  Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.

Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3099
Issue #3090
Issue #2720
2015-02-13 16:02:04 -08:00
Andrey Vesnovaty
5f15fa2216 Fix readdir for .zfs/snapshot directory
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.

This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2930
2015-02-10 16:34:30 -08:00
Brian Behlendorf
3941503c0a Retire zio_cons()/zio_dest()
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue #2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:09:49 -08:00
Brian Behlendorf
6442f3cfe3 Retire zio_bulk_flags
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches.  Today
this code works well and there's no compelling reason to keep
this functionality.  In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:08:49 -08:00
Jörg Thalheim
534759fad3 Linux 3.19 compat: file_inode was added
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)

Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3084
2015-02-10 11:24:51 -08:00
Brian Behlendorf
77aef6f60e Use vmem_alloc() for nvlists
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold.  Convert them to use vmem_alloc() so the
best allocator is used.

Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings.  Concurrently the large allocation warning threshold
was increased from 8k to 32k.  The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k.  This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3057
Closes #3079
Closes #3081
2015-02-10 11:00:08 -08:00
Brian Behlendorf
afe373260e Revert "Don't read space maps during import for readonly pools"
This reverts commit 7fc8c33ede which
accidentally introduced a ztest failure.

ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-02-09 16:56:59 -08:00
Brian Behlendorf
7fc8c33ede Don't read space maps during import for readonly pools
Normally when importing a pool the space maps for all top level
vdevs are read from disk.  The space maps will be required latter
when an allocation is performed and free blocks need to be located.

However, if the pool is imported readonly then we are guaranteed
that no allocations can occur.  In this case the space maps need
not be loaded..   A similar argument can be made for the DTLs
(dirty time logs).

Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
2015-02-09 16:43:03 -08:00
Justin T. Gibbs
33b4de513e Illumos 5311 - traverse_dnode may report success when it should not
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/2a89c2c
  https://www.illumos.org/issues/5311

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2970
2015-02-06 12:07:15 -08:00
Ned Bass
a62d1b02e3 Fix SA header size accounting
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.

  ln -s $(perl -e 'print "z" x 20') file
  setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file

Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.

Also improve the comments in sa_find_sizes().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3070
2015-02-06 09:26:46 -08:00
Brian Behlendorf
e2c4acde55 Skip evicting dbufs when walking the dbuf hash
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list.  In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2553
Closes #2495
2015-02-06 09:24:28 -08:00
Tim Chase
aa2ef419e4 Spurious ENOMEM returns when reading dbufs kstat
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller.  This caused spurious
ENOMEM errors when reading the dbufs kstat.

This commit causes the actual number of characters written to be returned.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3072
2015-02-04 16:35:26 -08:00
avg
037763e44e fix l2arc compression buffers leak
Commit log from FreeBSD:

    We have observed that arc_release() can be called concurrently with a
    l2arc in-flight write.  Also, we have observed that arc_hdr_destroy()
    can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
    flag in similar circumstances.

    Previously the l2arc headers would be freed while leaking their
    associated compression buffers.  Now the buffers are placed on
    l2arc_free_on_write list for delayed freeing.  This is similar to
    what was already done to arc buffers that were supposed to be freed
    concurrently with in-flight writes of those buffers.

    In addition to fixing the discovered leaks this change also adds
    some protective code to assert that a compression buffer associated
    with a l2arc header is never leaked.

    A new kstat l2_cdata_free_on_write is added.  It keeps a count
    of delayed compression buffer frees which previously would have
    been leaks.

    Tested by:  Vitalij Satanivskij <satan@ukr.net> et al
    Requested by:       many
    MFC after:  2 weeks
    Sponsored by:       HybridCluster / ClusterHQ

References:
    https://illumos.org/issues/5222
    https://github.com/freebsd/freebsd/commit/b98f85d
    http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781
    http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html
    http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3029
2015-02-03 16:54:16 -08:00
Brian Behlendorf
19ea3d25df Use zio buffers in zil_itx_create()
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K.  This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.

In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure.  This way we're always sure the
right allocation functions are used.  Notice that in the current
code kmem_free() and vmem_free() were both used.  This happened to
work because these wrappers map to the same internal SPL function.

This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings.  There
was no deadlock here just repeated allocation failures resulting
in increased latency.

Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P           O 3.2.0-0.bpo.4-amd64
Call Trace:
 [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
 [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
 [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
 [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
 [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
 [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
 [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
 [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
 [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3059
2015-02-02 11:20:41 -08:00
Brian Behlendorf
0365064a97 Handle closing an unopened ZVOL
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool.  Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().

In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:14 -08:00
Brian Behlendorf
a127e841de Add zvol_open() error handling for readonly property
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:06 -08:00
Tim Chase
b0cf0676c0 Fix removal of SA in sa_modify_attrs()
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR().  The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028
2015-01-21 16:35:14 -08:00
Richard Yao
841c9d43c7 Use kmem_vasprintf() in log_internal()
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.

It also brings this function almost back in sync with Illumos.  This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781
2015-01-21 15:30:24 -08:00
Tim Chase
3c832b8cc1 Linux 3.12 compat: split shrinker has s_shrink
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.

This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2975
2015-01-20 14:07:59 -08:00
Brian Behlendorf
81971b137a Revert "SA spill block cache"
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
285b29d959 Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf
efcd79a883 Retire KM_NODEBUG
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:40:32 -08:00
Richard Yao
71f8548ea4 Use is_vmalloc_addr() in vdev_disk.c
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
92119cc259 Mark IO pipeline with PF_FSTRANS
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
d958324f97 Fix zfs_putpage() lock inversion (again)
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range().  Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read().  The page must be unlocked before taking
the range lock.  This patch corrects that issue.

In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
2015-01-08 16:09:41 -08:00
Ned Bass
33b6dbbc51 Document zfs_flags module parameter
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter.  Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.

zfs_flags (int)
            Set  additional debugging flags. The following flags may be
            bitwise-or'd together.

            +-------------------------------------------------------+
            |Value   Symbolic Name                                  |
            |        Description                                    |
            +-------------------------------------------------------+
            |    1   ZFS_DEBUG_DPRINTF                              |
            |        Enable dprintf entries in the debug log.       |
            +-------------------------------------------------------+
            |    2   ZFS_DEBUG_DBUF_VERIFY *                        |
            |        Enable extra dbuf verifications.               |
            +-------------------------------------------------------+
            |    4   ZFS_DEBUG_DNODE_VERIFY *                       |
            |        Enable extra dnode verifications.              |
            +-------------------------------------------------------+
            |    8   ZFS_DEBUG_SNAPNAMES                            |
            |        Enable snapshot name verification.             |
            +-------------------------------------------------------+
            |   16   ZFS_DEBUG_MODIFY                               |
            |        Check for illegally modified ARC buffers.      |
            +-------------------------------------------------------+
            |   32   ZFS_DEBUG_SPA                                  |
            |        Enable spa_dbgmsg entries in the debug log.    |
            +-------------------------------------------------------+
            |   64   ZFS_DEBUG_ZIO_FREE                             |
            |        Enable verification of block frees.            |
            +-------------------------------------------------------+
            |  128   ZFS_DEBUG_HISTOGRAM_VERIFY                     |
            |        Enable extra spacemap histogram verifications. |
            +-------------------------------------------------------+
            * Requires debug build.

            Default value: 0.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2988
2015-01-07 15:50:49 -08:00
Ned Bass
49ee64e5e6 Remove duplicate typedefs from trace.h
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.

Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.

These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
2015-01-06 16:53:24 -08:00
Brian Behlendorf
74328ee18f Fix zfs_putpage() lock inversion
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock.  To
prevent this we must always manipulate the writeback bit while
holding the range lock.  The exact deadlock is as follows:

------ Process A ------        ------ Process B ------
zpl_writepages                 zpl_fallocate
write_cache_pages              zpl_fallocate_common
zpl_putpage                    zfs_space
zfs_putpage (set bit)          zfs_freesp
zfs_range_lock (wait on lock)  zfs_free_range (take lock)
[has not yet initiated I/O,    truncate_inode_pages_range
the bit will not be cleared]   wait_on_page_writeback (wait on bit)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
2014-12-22 09:31:56 -08:00
Boris Protopopov
9063f65476 Correct error returns to unify cross-pool operation error handling
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2911
2014-12-19 10:51:24 -08:00
Brian Behlendorf
c944be5d7e Fix snapshots with dirty inodes
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode.  This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812
2014-11-20 10:38:16 -08:00
Isaac Huang
29b763cd2c bio_alloc() with __GFP_WAIT never returns NULL
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL.  However, we want to keep
the error handling in case this behavior changes in the futre.

Plus fix a small style issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2703
2014-11-19 12:50:49 -05:00
Ned Bass
aaed7c408c Explicitly include SPL compat headers
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages.  Include a
few compatiblity headers explicitly to cope with that change.  Also,
sort some linux-specific inclusions alphabetically.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2898
2014-11-19 12:30:39 -05:00
Ned Bass
7b2d78a046 Fix improper null-byte termination handling
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.

- The snprintf() function always produces a null-byte terminated string
  for non-negative return values, so it is not necessary to write out a
  null-byte as a separate step.

- Also, it is unsafe to use the return value of snprintf() as an offset
  for placing a null-byte, because if the output was truncated the return
  value is the number of bytes that _would_ have been written had enough
  space been available. Therefore the return value may index beyond the
  array boundaries.

- Finally, snprintf() accounts for the null-byte when limiting its output
  size, so there is no need to pass it a size parameter that is one less
  than the buffer size.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2875
2014-11-17 15:28:59 -08:00
smh
89b1cd6581 Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.

References:
  https://github.com/freebsd/freebsd/commit/48431b7
  http://svnweb.freebsd.org/base?view=revision&revision=273158
  https://reviews.csiden.org/r/132/

Porting notes:

This issue was filed as illumos 5347 and a more comprehensive fix is
under review.  Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2896
2014-11-17 11:35:38 -08:00
Tim Chase
4254acb057 Undirty freed spill blocks.
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
2014-11-17 11:25:48 -08:00
Prakash Surya
0b39b9f96f Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.

The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:

    * A number of external tools can make use of our tracepoints
      "automatically" (e.g. perf, systemtap)
    * Tracepoints are designed to be extremely cheap when disabled
    * It's one of the "accepted" ways to export this kind of
      information; many other kernel subsystems use tracepoints too.

Unfortunately, though, there are a few caveats as well:

    * Linux tracepoints appear to only be available to GPL licensed
      modules due to the way certain kernel functions are exported.
      Thus, to actually make use of the tracepoints introduced by this
      patch, one might have to patch and re-compile the kernel;
      exporting the necessary functions to non-GPL modules.

    * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
      tracepoints are not available for unsigned kernel modules
      (tracepoints will get disabled due to the module's 'F' taint).
      Thus, one either has to sign the zfs kernel module prior to
      loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
      newer.

Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):

    # list all zfs tracepoints available

    $ ls /sys/kernel/debug/tracing/events/zfs
    enable              filter              zfs_arc__delete
    zfs_arc__evict      zfs_arc__hit        zfs_arc__miss
    zfs_l2arc__evict    zfs_l2arc__hit      zfs_l2arc__iodone
    zfs_l2arc__miss     zfs_l2arc__read     zfs_l2arc__write
    zfs_new_state__mfu  zfs_new_state__mru

    # enable all zfs tracepoints, clear the tracepoint ring buffer

    $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
    $ echo 0 > /sys/kernel/debug/tracing/trace

    # import zpool called 'tank', inspect tracepoint data (each line was
    # truncated, they're too long for a commit message otherwise)

    $ zpool import tank
    $ cat /sys/kernel/debug/tracing/trace | head -n35
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 1219/1219   #P:8
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
            lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
          z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
          z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
          z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
          z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
          z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
          z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
          z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...

To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:

    lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
        hdr {
            dva 0x1:0x40082
            birth 15491
            cksum0 0x163edbff3a
            flags 0x640
            datacnt 1
            type 1
            size 2048
            spa 3133524293419867460
            state_type 0
            access 0
            mru_hits 0
            mru_ghost_hits 0
            mfu_hits 0
            mfu_ghost_hits 0
            l2_hits 0
            refcount 1
        } bp {
            dva0 0x1:0x40082
            dva1 0x1:0x3000e5
            dva2 0x1:0x5a006e
            cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
            lsize 2048
        } zb {
            objset 0
            object 0
            level -1
            blkid 0
        }

For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.

For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:

    * http://lwn.net/Articles/379903/
    * http://lwn.net/Articles/381064/
    * http://lwn.net/Articles/383362/

I should also node the more "boring" aspects of this patch:

    * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
       support a sixth paramter. This parameter is used to populate the
       contents of the new conftest.h file. If no sixth parameter is
       provided, conftest.h will be empty.

    * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
      This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
      except it has support for a fifth option that is then passed as
      the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.

These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).

    * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
      is to determine if we can safely enable the Linux tracepoint
      functionality. We need to selectively disable the tracepoint code
      due to the kernel exporting certain functions as GPL only. Without
      this check, the build process will fail at link time.

In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:55 -08:00
Ned Bass
29e57d15c8 Fix dprintf format specifiers
Fix a few dprintf format specifiers that disagreed with their argument
types.  These came to light as compiler errors when converting dprintf
to use the Linux trace buffer.  Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:45 -08:00
Ned Bass
59ec819a0c Move a few internal ARC strucutres to arc_impl.h
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:27 -08:00
Prakash Surya
fb42a49328 Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO
5213 panic in metaslab_init due to space_map_open returning ENXIO
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com

References:
  https://www.illumos.org/issues/5213
  https://reviews.csiden.org/r/110

Porting notes:

For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2745
2014-11-14 15:37:45 -08:00
Chris Wedgwood
b31d8ea77c Reduce buf/dbuf mutex contention
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.

This increase in hash table size adds approximating 0.5M to
our fixed memory footprint.  This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized.  In the meanwhile, this small
change significantly reduces contention for certain workloads.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #1291
2014-11-14 14:59:21 -08:00
Alex Zhuravlev
0f69910833 Export symbols for ZIL interface
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL.  In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.

Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2892
2014-11-14 14:39:43 -08:00
Tim Chase
ed6e9cc235 Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2837
2014-10-28 09:34:51 -07:00
Matthew Ahrens
9635861742 Illumos 5164-5165 - space map fixes
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
     space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5164
  https://www.illumos.org/issues/5165
  https://github.com/illumos/illumos-gate/commit/b1be289

Porting Notes:

The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.

The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz.  The upstream commit incorrectly
used space_map_blksize.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Alex Reece
b02fe35d37 Illumos 4958 zdb trips assert on pools with ashift >= 0xe
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4958
  https://github.com/illumos/illumos-gate/commit/2a104a5

Porting notes:

Keep the ZIO_FLAG_FASTWRITE define.  This is for a feature present
in Linux but not yet in *BSD.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Brian Behlendorf
5f6d0b6f5a Handle block pointers with a corrupt logical size
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything.  This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block.  If a blocks checksum is valid then its contents are
trusted by the higher layers.

This system works exceptionally well as long as bad data is never
written with a valid checksum.  If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.

One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.

To prevent this from happening the arc_read() function has been
updated to detect this specific case.  If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2678
2014-10-23 09:20:52 -07:00
Ned Bass
bc151f7b31 Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
2014-10-22 11:06:53 -07:00
Matthew Ahrens
88904bb3e3 Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5162
  https://github.com/illumos/illumos-gate/commit/8a90470

Porting notes:
  Fix spelling error 's/arena/area/' in dmu.c.
  In restore_write() declare bonus and abuf at the top of the function.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2696
2014-10-21 16:32:11 -07:00
Matthew Ahrens
4b20a6f509 Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!

Steps to reproduce:

  * zpool create test c1t1d0
  * zfs create test/fs
  * zfs snapshot test/fs@a
  * zfs clone test/fs@a test/clone
  * zfs destroy -d test/fs@a
  * zfs clone test/fs@a test/clone2
  * zfs snapshot test/clone2@a
  * zfs hold hld test/clone2@a
  * zfs release hld test/clone2@a
  * zfs list -r -t all test

  <test/clone2@a has been destroyed>

We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.

5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5150
  https://github.com/illumos/illumos-gate/commit/42fcb65

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2690
2014-10-21 15:26:58 -07:00
Matthew Ahrens
6c59307a3c Illumos 3693 - restore_object uses at least two transactions to restore an object
Restore_object should not use two transactions to restore an object:
  * one transaction is used for dmu_object_claim
  * another transaction is used to set compression, checksum and most
    importantly bonus data
  * furthermore dmu_object_reclaim internally uses multiple transactions
  * dmu_free_long_range frees chunks in separate transactions
  * dnode_reallocate is executed in a distinct transaction

The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups.  Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.

3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon

References:
  https://www.illumos.org/issues/3693
  https://github.com/illumos/illumos-gate/commit/e77d42e

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2689
2014-10-21 15:26:50 -07:00
Tim Chase
356d9ed4c8 Don't perform ACL-to-mode translation on empty ACL
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL.  If no ACL exists, this
processing shouldn't be done.  Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:

	# create a file with typical mode of 664
	echo test > test
	chown anyuser test
	ls -l test

and the mode will show up as all zeroes.  Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1264
2014-10-21 09:23:27 -07:00
Daniil Lunev
62bdd5eb7a Illumos 4924 - LZ4 Compression for metadata
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://github.com/illumos/illumos-gate/commit/b8289d2
  https://www.illumos.org/issues/3756

Porting notes:

The static function zfs_prop_activate_feature() was removed because
this change removes the only caller.  The function was not removed
from Illumos but instead left as dead code.  However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1540
2014-10-20 16:17:49 -07:00
Brian Behlendorf
ba232d8aea Suppress AIO kmem warnings
The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO.  While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2774
2014-10-20 16:10:25 -07:00
Brian Behlendorf
33074f2254 Handle NULL mirror child vdev
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev.  In
this case the child should be skipped and the read issues to
another member of the mirror.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1744
2014-10-17 14:59:05 -07:00
Brian Behlendorf
f0e324f25d Update utsname support
Modify the code to use the utsname() kernel function rather than
a global variable.  This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2).  This means that it
will behave consistently in both contexts.

This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions.  And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.

Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:57 -07:00
Brian Behlendorf
050d22b068 Remove shrink_dcache_memory() and shrink_icache_memory()
This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface.  Therefore, this functionality is being dropped
for earlier kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:50 -07:00
Brian Behlendorf
e6659763c6 Improve VERIFY() error in dmu_write()
This is a debug patch designed to ensure an error code is logged
to the console when this VERIFY() is hit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #1440
2014-10-08 09:18:14 -07:00
Brian Behlendorf
8878261fc9 Fix CPU_SEQID use in preemptible context
Commit e022864 introduced a regression for kernels which are built
with CONFIG_DEBUG_PREEMPT.  The use of CPU_SEQID in a preemptible
context causes zio_nowait() to trigger the BUG.  Since CPU_SEQID
is simply being used as a random index the usage here is safe. To
resolve the issue preempt is disable while calling CPU_SEQID.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2769
2014-10-07 16:40:29 -07:00
Matthew Ahrens
e022864d19 Illumos 5176 - lock contention on godfather zio
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5176
  https://github.com/illumos/illumos-gate/commit/6f834bc

Porting notes:

Under Linux max_ncpus is defined as num_possible_cpus().  This is
largest number of cpu ids which might be available during the life
time of the system boot.  This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2711
2014-10-07 11:24:24 -07:00
Richard Yao
83e9986f6e Implement -t option to zpool create for temporary pool names
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.

26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.

This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
2014-09-30 10:46:59 -07:00
Tim Chase
cb08f06307 Perform whole-page page truncation for hole-punching under a range lock
As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161f
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp().  This left a window of opportunity during which
the full pages could be touched.

This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2733
2014-09-29 09:22:03 -07:00
Max Grossman
36283ca233 Illumos 5138 - add tunable for maximum number of blocks freed in one txg
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5138
  https://github.com/illumos/illumos-gate/commit/af3465d

Porting notes:

Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long.  This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.

The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos.  This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2675
Closes #2581
2014-09-23 14:26:34 -07:00
Alex Reece
acbad6ff67 Illumos 4753 - increase number of outstanding async writes when sync task is waiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/4753
    https://github.com/illumos/illumos-gate/commit/73527f4

Comments by Matt Ahrens from the issue tracker:
    When a sync task is waiting for a txg to complete, we should hurry
    it along by increasing the number of outstanding async writes
    (i.e. make vdev_queue_max_async_writes() return a larger number).
    Initially we might just have a tunable for "minimum async writes
    while a synctask is waiting" and set it to 3.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2716
2014-09-23 13:50:55 -07:00
Matthew Ahrens
d97aa48f7c Illumos 5139 - SEEK_HOLE failed to report a hole at end of file
5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5139
  https://github.com/illumos/illumos-gate/commit/0fbc0cd

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2714
2014-09-23 10:38:45 -07:00
Richard Yao
485c581c41 Fix function call with uninitialized value in vdev_inuse
LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
2014-09-23 10:32:45 -07:00
Matthew Ahrens
b8bcca18f7 Illumos 5161 - add tunable for number of metaslabs per vdev
5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5161
  https://github.com/illumos/illumos-gate/commit/bf3e216

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2698
2014-09-23 10:00:02 -07:00
Matthew Ahrens
ebcf49365a Illumos 5177 - remove dead code from dsl_scan.c
5177 remove dead code from dsl_scan.c
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5177
  https://github.com/illumos/illumos-gate/commit/5f37736

Porting notes:

The local variable 'buf' was removed from dsl_scan_visitbp().
This wasn't part of the original patch but it should have been.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2712
2014-09-22 15:52:58 -07:00
Adam Leventhal
64dbba3679 Illumos 5174 - add sdt probe for blocked read in dbuf_read()
5174 add sdt probe for blocked read in dbuf_read()
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5174
  https://github.com/illumos/illumos-gate/commit/f6164ad

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2710
2014-09-22 14:20:25 -07:00
Matthew Ahrens
6d9036f350 Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5140
  https://github.com/illumos/illumos-gate/commit/2243853

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2676
2014-09-18 15:04:59 -07:00
Brian Behlendorf
2d50158343 Fix z_teardown_inactive_lock deadlock
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock.  This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().

Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock.  Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2670
2014-09-11 09:11:45 -07:00
Tim Chase
223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
George Wilson
4f68d7878f Illumos 5117 - spacemap reallocation can cause corruption
5117 space map reallocation can cause corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/projects/illumos-gate/issues/5117
  https://github.com/illumos/illumos-gate/commit/e503a68

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2662
2014-09-08 09:42:39 -07:00
Brian Behlendorf
ceb49b0acd Add object type checking to zap_lockdir()
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object.  This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().

The good news is that is a situation which can not occur unless the
pool has been damaged.  The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools.  Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.

Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object.  If a non-ZAP
object has been passed it EINVAL will be returned immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
2014-09-08 09:15:38 -07:00
Richard Yao
cd3939c5f0 Linux AIO Support
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #223
Closes #2373
2014-09-05 15:11:43 -07:00
Alex Reece
f38dfec3fd Illumos 5049 - panic when removing log device
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Approved by: Rich Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5049
  https://github.com/illumos/illumos-gate/commit/2986efa

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2636
2014-09-05 09:06:27 -07:00
Stanislav Seletskiy
2078f21015 Fix invalid locking order in rename operation
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2652.
Closes #2525.
2014-09-04 09:50:46 -07:00
Alexey Smirnoff
0dfc732416 Change the default 'zfs_dedup_prefetch' value to '0'
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2639
2014-09-04 09:50:45 -07:00
Dan Swartzendruber
287be44f53 Improve handling of filesystem versions
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.

Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
2014-09-03 09:17:14 -07:00
Matthew Ahrens
dea377c0d9 Illumos 4970-4974 - extreme rewind enhancements
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4970
  https://www.illumos.org/issues/4971
  https://www.illumos.org/issues/4972
  https://www.illumos.org/issues/4973
  https://www.illumos.org/issues/4974
  https://github.com/illumos/illumos-gate/commit/e42d205

Notes:
    This set of patches adds a set of tunable parameters for the
    "extreme rewind" mode of pool import which allows control over
    the traversal performed during such an import.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2598
2014-08-26 16:29:57 -07:00
Matthew Ahrens
49ddb31506 Illumos 5034 - ARC's buf_hash_table is too small
5034 ARC's buf_hash_table is too small

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5034
  https://github.com/illumos/illumos-gate/commit/63e911b

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2615
2014-08-26 16:14:49 -07:00
Isaac Huang
0426c16804 Fixed memory leaks in zevent handling
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2158
2014-08-20 10:45:16 -07:00
Matthew Ahrens
bd089c5477 Illumos 4631 - zvol_get_stats triggering too many reads
4631 zvol_get_stats triggering too many reads

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4631
  https://github.com/illumos/illumos-gate/commit/bbfa8ea

Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2612
Closes #2480
2014-08-20 09:17:00 -07:00
Tim Chase
8b0a0840b4 Don't upgrade a metaslab when the pool is not writable
Illumos 4982 added code to metaslab_fragmentation() to proactively update
space maps when the spacemap_histogram feature is enabled.  This should
only happen when the pool is writeable.

References:
  https://www.illumos.org/issues/4982
  https://github.com/illumos/illumos-gate/commit/2e4c998

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:47:19 -07:00
George Wilson
f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Brian Behlendorf
0d5c500d6c Revert "Revert "Revert "Fix unlink/xattr deadlock"""
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations.  This
was possible due to the following related changes.

e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
Closes #2058
Closes #2128
Closes #2240
2014-08-11 16:12:36 -07:00
Brian Behlendorf
0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Brian Behlendorf
4dd18932ba Avoid 128K kmem allocations in mzap_upgrade()
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available.  Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir().  This is shown below.

The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required.  This could be
done but it would result in us diverging further from the other
implementations.  Therefore I've opted against doing this
unless it becomes absolutely necessary.

Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close #2580
2014-08-11 16:10:32 -07:00
Brian Behlendorf
50b25b2187 Avoid dynamic allocation of 'search zio'
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage.  Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.

To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure.  All
access must be serialized through the vq_lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2572
2014-08-11 08:44:54 -07:00
Brian Behlendorf
ab6f407faa Use KM_PUSHPAGE in dsl_dataset_rollback_check()
The dsl_dataset_rollback_check() function is executed in the
txg_sync context.  To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE.  This was introduced by
the recent 'zfs bookmark' features, commit da53684.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes #2569
2014-08-06 16:09:28 -07:00
Matthew Ahrens
5dbd68a352 Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t
4914 zfs on-disk bookmark structure should be named *_phys_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4914
  https://github.com/illumos/illumos-gate/commit/7802d7b

Porting notes:

There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated.  This should reduce the likelihood of further
problems like issue #2094 from occurring.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2558
2014-08-06 14:48:41 -07:00
Matthew Ahrens
1fa8f795d5 Illumos 4881 - zfs send performance regression with embedded data
4881 zfs send performance degradation when embedded block pointers
     are encountered

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4881
  https://github.com/illumos/illumos-gate/commit/06315b7

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2547
2014-08-06 14:44:10 -07:00
Saso Kiselkov
3bec585e6c Illumos 4897 - Space accounting mismatch in L2ARC/zpool
4897 Space accounting mismatch in L2ARC/zpool

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

From the illumos issue tracker:

	L2ARC vdev space usage statistics are calculated as the delta
	between the maximum and minimum vdev offset ever written to
	by the L2ARC fill thread, but do not inform the user of how
	much space in between these two offsets is actually taken up by
	cached buffers. This fix changes that so that vdev space usage
	stats on L2ARC devices accurately track the volume of buffers
	stored on them, allowing users to see the exact L2ARC usage in
	"zpool iostat -v".

References:
  https://www.illumos.org/issues/4897
  https://github.com/illumos/illumos-gate/commit/3038a2b

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2555
2014-08-06 13:44:10 -07:00
Matthew Ahrens
fbeddd60b7 Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4390
  https://github.com/illumos/illumos-gate/commit/7fd05ac

Porting notes:

Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code.  This patch should reduce its
stack footprint a bit more.

The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.

The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global.  Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2545
2014-08-04 11:50:52 -07:00
Matthew Ahrens
9b67f60560 Illumos 4757, 4913
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4757
  https://www.illumos.org/issues/4913
  https://github.com/illumos/illumos-gate/commit/5d7b4d4

Porting notes:

For compatibility with the fastpath code the zio_done() function
needed to be updated.  Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2544
2014-08-01 14:28:05 -07:00
Matthew Ahrens
faf0f58c69 Illumos 3835 zfs need not store 2 copies of all metadata
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Description from Matt Ahrens's bug report at Delphix:

    Add a new zfs property, "redundant_metadata" which can have values
    "all" or "most".  The default will be "all", which is the current
    behavior.  Setting to "most" will cause us to only store 1 copy of
    level-1 indirect blocks of user data files.

Additional notes:

    The new man page section for this property states

        "The exact behavior of which metadata blocks
         are stored redundantly may change in future releases."

    and:

        "When set to most, ZFS stores an extra copy of most types of
         metadata. This can improve performance of random writes,
         because less metadata must be written."

    The current implementation is as described above in Matt's blog.
    It is controlled by a new global integer
    "zfs_redundant_metadata_most_ditto_level", currently initialized
    to 2. When "redundant_metadata" is set to "most", only indirect
    blocks of the specified level and higher will have additional ditto
    blocks created.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2542
2014-07-31 09:49:34 -07:00
George Wilson
672692c7b7 Illumos 4754, 4755
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4754
  https://www.illumos.org/issues/4755
  https://github.com/illumos/illumos-gate/commit/b6240e8

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2533
2014-07-30 10:30:05 -07:00
Matthew Ahrens
9bd274ddd8 Illumos #4374
4374 dn_free_ranges should use range_tree_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4374
  https://github.com/illumos/illumos-gate/commit/bf16b11

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2531
2014-07-30 09:20:35 -07:00
Matthew Ahrens
da536844d5 Illumos 4368, 4369.
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4369
  https://www.illumos.org/issues/4368
  https://github.com/illumos/illumos-gate/commit/78f1710

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2530
2014-07-29 10:55:29 -07:00
Max Grossman
b0bc7a84d9 Illumos 4370, 4371
4370 avoid transmitting holes during zfs send
4371 DMU code clean up

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4370
  https://www.illumos.org/issues/4371
  https://github.com/illumos/illumos-gate/commit/43466aa

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2529
2014-07-28 14:29:58 -07:00
Matthew Ahrens
fa86b5dbb6 Illumos 4171, 4172
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features

Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4171
  https://www.illumos.org/issues/4172
  https://github.com/illumos/illumos-gate/commit/2acef22

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2528
2014-07-25 16:40:07 -07:00
Brian Behlendorf
7a8f0e80ea zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT)
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2488
2014-07-22 09:41:38 -07:00
George Wilson
080b310015 Illumos #4756 Fix metaslab_group_preload deadlock
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.

References:
  https://www.illumos.org/issues/4756
  https://github.com/illumos/illumos-gate/commit/30beaff

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:41:32 -07:00
George Wilson
3c51c5cb1f Illumos #4730 destroy metaslab group taskq
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()

Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4730
  https://github.com/illumos/illumos-gate/commit/be08211

Porting notes:

Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:41:06 -07:00
George Wilson
93cf20764a Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.

This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.

The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram

In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:

    * 4K sector devices will not see any compression benefit
    * large space_maps require more metadata on-disk
    * large space_maps require more time to load (typically random reads)

Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.

A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.

References:
  https://www.illumos.org/issues/4101
  https://www.illumos.org/issues/4102
  https://www.illumos.org/issues/4103
  https://www.illumos.org/issues/4105
  https://www.illumos.org/issues/4106
  https://github.com/illumos/illumos-gate/commit/0713e23

Porting notes:

A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:39:16 -07:00
Prakash Surya
1be627f5c2 Move metaslab_group_alloc_update() call
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac3ea.

This aligns us more closely to the upstream illumos code base.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-07-22 09:38:16 -07:00
Brian Behlendorf
1e8db77102 Fix zil_commit() NULL dereference
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot.  If they do somehow get
dirtied an attempt will make made to write them to disk.  In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2405
2014-07-17 15:15:07 -07:00
George Wilson
2fbc542ebd Illumos 4168, 4169, 4170: ztest, zdb and zhack fixes
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
    https://www.illumos.org/issues/4168
    https://www.illumos.org/issues/4169
    https://www.illumos.org/issues/4170
    https://github.com/illumos/illumos-gate/commit/7fdd916

Porting notes:

Of particular interest when troubleshooting corrupted pools, the
commonly-used "zdb -e" operation may perform verbatim imports and
furthermore, it will soon have direct support for verbatim imports via
a new "-V" option.  The 4169 fix eliminates a common segfault case in
which spa_history_log_version() tries to access an un-opened dsl_pool_t.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2451
Closes #2283
Closes #2467
2014-07-17 11:37:57 -07:00
Tim Chase
f4a4046bd6 Convert zfs_mg_noalloc_threshold to a module parameter and document
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea.  This patch
documents the parameter and allows for it to be set as a module parameter.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2483
2014-07-16 16:49:25 -07:00
Andrew Barnes
61e99a73bc Preserve asize when last mirror child promoted to top-level vdev
If the smaller of 2 different sized child vdev's of a mirrored vdev is
detached, and the pool has the autoexpand property set to off, as the
remaining larger vdev is promoted to a top level vdev it fails to retain
the asize of the original top level mirror vdev and therefore partially
autoexpands.

This partially autoexpanded state leaves the new vdev too large to
re-mirror by adding the smaller vdev back in, and the pool fails to
utilize the space until next imported.

If the autoexpand property is set to on, the child vdev grows
in size after it has been promoted to a top level vdev as expected.

This commit causes the remaining child mirror to retain the asize of its
old parent mirror vdev if the autoexpand property is set to off,
this allows the smaller vdev to be re-added if required the vdev
can then be told to expand if required by the usual using zpool online -e.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #1208
2014-07-02 14:04:29 -07:00
Dan McDonald
ee4712284c Illumos #4936 fix potential overflow in lz4
4936 lz4 could theoretically overflow a pointer with a certain input
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported by: Tim Chase <tim@chase2k.com>

References:
  https://illumos.org/issues/4936
  https://github.com/illumos/illumos-gate/commit/58d0718

Porting notes:

This fixes the widely-reported "20-year-old vulnerability" in
LZO/LZ4 implementations which inherited said bug from the reference
implementation.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2429
2014-07-01 14:10:47 -07:00
Tim Chase
4240dc332d Comment the lack of real_LZ4_uncompress()
Added several comments regarding the removal of real_LZ4_uncompress()
which exists in the reference implementation but has been removed here
since it's not used.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-07-01 14:09:56 -07:00
Brian Behlendorf
7f6884f419 Revert "Fix __zio_execute() asynchronous dispatch"
This reverts commit 91579709fc which
limited the asynchronous dispatch to kernel space.  We want to do
this for two reasons:

1) While we have slightly more headroom in user space excessively
   deep stacks have been observed while running ztest, see #2293.

2) Removing this conditional makes the pipeline behave consistently
   regardless of if it's executing in kernel space or user space.
   This way we're more likely to uncover subtle issues with ztest.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2384
2014-06-11 16:32:57 -07:00
Brian Behlendorf
6795a698f4 Use default slab types
We should not override the default memory type of the kmem cache.  This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.

The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64.  This means there is long longer
a need to override the default for the caches.  And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.

The following is a list of caches that will be impacted:

                  | object size   | forced type   | default type
----------------- | ------------- | ------------- | --------------
dnode_t           | 936 bytes     | KMC_KMEM      | KMC_KMEM
zio_cache         | 1104 bytes    | *KMC_KMEM     | *KMC_VMEM
zio_link_cache    | 48 bytes      | KMC_KMEM      | KMC_KMEM
zio_vdev_cache    | 131088 bytes  | KMC_VMEM      | KMC_VMEM
zio_buf_512       | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_data_buf_512  | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_buf_1024      | 1024 bytes    | KMC_KMEM      | KMC_KMEM
zio_data_buf_1024 | 1024 bytes    | +KMC_VMEM     | +KMC_KMEM

* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.

This patch removes another slight point of divergence between ZoL
and Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2337
2014-05-22 10:39:52 -07:00
HC
f9a1ac4d59 Honor zfs_nocacheflush for file vdevs
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.

Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2336
2014-05-19 13:30:48 -07:00
Tim Chase
83021b47c2 Calculate header size correctly in sa_find_sizes()
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA.  This problem could be
reproduced when xattr=sa enabled as follows:

	ln -s $(perl -e 'print "x" x 120') blah
	setfattr -n security.selinux -v blahblah -h blah

The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):

	[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]

Adding the SA xattr will attempt to extend the registration to:

	[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]

but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:

	[ ZPL_SYMLINK ZPL_DXATTR ]

This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
2014-05-19 11:55:50 -07:00
Tim Chase
3937ab20f3 Allow for lock-free reading zfsdev_state_list.
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed.  It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2301
2014-05-19 11:45:11 -07:00
Chunwei Chen
bc25c9325b Use a dedicated taskq for vdev_file
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.

We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2270
2014-05-14 16:20:21 -07:00
Brian Behlendorf
2c33b91275 Handle vdev_lookup_top() failure in dva_get_dsize_sync()
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail.  However, the NULL dereference at
clearly shows that under certain circumstances it is possible.  Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.

  BUG: unable to handle kernel NULL pointer dereference at 00000570

  crash> struct -o vdev
  struct vdev {
       [0] uint64_t vdev_id;
       ... ...
    [1376] uint64_t vdev_deflate_ratio;

Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes #1707
Closes #1987
Closes #1891

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-06 10:41:48 -07:00
Tim Chase
962d524212 Check the dataset type more rigorously when fetching properties.
When fetching property values of snapshots, a check against the head
dataset type must be performed.  Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".

This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes.  It also did not allow for the display of
"volsize" of a snapshot of a volume.

This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly.  This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2265
2014-05-06 10:41:46 -07:00
Brian Behlendorf
1ce0457348 Fix style
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-06 10:41:17 -07:00
George Wilson
aa7d06a98a Illumos #4101 finer-grained control of metaslab_debug
Today the metaslab_debug logic performs two tasks:

- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync

This change provides knobs for each of these independently.

References:
  https://illumos.org/issues/4101
  https://github.com/illumos/illumos-gate/commit/0713e23

Notes:

1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2227
2014-05-06 09:46:04 -07:00
Brian Behlendorf
cc79a5c263 Treat spill block dbufs as meta data
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated.  These spill blocks are currently considered data
blocks.  However, they should be accounted for as meta data
because they are effectively an extension of the dnode.

While this may seem like a minor accounting issue it has broader
implications.  The key thing to be aware of is that each spill
block will hold a reference on its parent dnode.  The dnode in
turn holds a reference on its dbuf in the dnode object.  This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data.  This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.

However, unlike the small file case a spill block can legitimately
be considered meta data.  By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached.  This then allows the dnodes and dbufs which the spill
block was pinning to be released.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2294
2014-05-05 13:56:59 -07:00
Brian Behlendorf
12f9a6a3f9 dmu_tx_assign() should not return ENOMEM
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space.  If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned.  Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.

/*
 * Assign tx to a transaction group.  txg_how can be one of:
 *
 * (1)  TXG_WAIT.  If the current open txg is full, waits until there's
 *      a new one.  This should be used when you're not holding locks.
 *      It will only fail if we're truly out of space (or over quota).
 * ...
 */

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2287
2014-05-01 12:08:53 -07:00
Richard Yao
9d317793aa Implement File Attribute Support
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().

Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method.  That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.

References:
  https://bugs.gentoo.org/show_bug.cgi?id=483516

Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1691
2014-05-01 10:11:18 -07:00
Chunwei Chen
17584980b9 Add assertion to catch 0-count page
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.

This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2277
2014-04-25 15:41:19 -07:00
Ned Bass
de39ec11b8 Fix LZ4 endianness autodetection
Endianness detection in LZ4 is broken in user-space builds.  This
bug corrupts compressed data and manifests itself in several ztest
failures.  When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.

The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system.  On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.

While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #1963
Fixes #1964
Fixes #1965
2014-04-20 16:55:42 -07:00
Brian Behlendorf
4fd762f8ad Fix zfsdev_ioctl() kmem leak warning
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue.  All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.

In this case the code used strfree() to release the memory allocated
by kmem_alloc().  Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.

To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory.  Like strfree(), strdup() is
not integrated with the memory accounting.  This means we can use
strfree() to release it like Illumos.

  SPL: kmem leaked 10/4368729 bytes
  address          size  data             func:line
  ffff880067e9aa40 10    ZZZZZZZZZZ       zfsdev_ioctl:5655

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #2262
2014-04-18 13:30:15 -07:00
DHE
2dbedf5484 Uninitialized variable spa_autoreplace used
Caught by ztest and valgrind.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2259
2014-04-16 10:59:24 -07:00
Chunwei Chen
0b75bdb369 Use ddi_time_after and friends to compare time
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2142
2014-04-14 13:27:56 -07:00
Chunwei Chen
b761912b34 Linux 3.14 compat: rq_for_each_segment in dmu_req_copy
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:51 -07:00
Chunwei Chen
22760eebef Revert "Fix zvol+btrfs hang"
After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.

This reverts commit e26ade5101.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:47 -07:00
Chunwei Chen
215b4634c7 Refactor dmu_req_copy for immutable biovec changes
Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:43 -07:00
Chunwei Chen
d4541210f3 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:38 -07:00
Chunwei Chen
408ec0d2e1 Linux 3.14 compat: posix_acl_{create,chmod}
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:27:03 -07:00
Richard Yao
f3ad9cd67a Fix locking order in zfs_zget()
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-04 09:12:47 -07:00
Richard Yao
6f9548c487 Fix deadlock in zfs_zget()
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
2014-04-04 09:11:54 -07:00
Brian Behlendorf
8ac67298b1 Revert "Fixed a use-after-free bug in zfs_zget()."
This reverts commit 36df284366.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-03 16:23:28 -07:00
Brian Behlendorf
904ea2763e Add automatic hot spare functionality
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
2014-04-02 13:10:08 -07:00
Brian Behlendorf
9b101a7320 Clarify zpool_events_next() comment
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:11:08 -07:00
Brian Behlendorf
75e3ff58fe Add zpool_events_seek() functionality
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID.  When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:10:57 -07:00
Brian Behlendorf
a2f1945ee3 Add a unique "eid" value to all zevents
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents.  By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:10:41 -07:00
Boris Protopopov
0ed212dc0e Illumos #4089 NULL pointer dereference in arc_read()
4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4089
  illumos/illumos-gate@57815f6b95

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2171
Issue #2165
Closes #2198
2014-03-24 11:06:57 -07:00
Richard Yao
26b42f3f9d Implement -t option to zpool import for temporary pool names
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189
2014-03-20 12:05:30 -07:00
Andrey Vesnovaty
00fcdee1f8 Fix regression introduced in port of Illumos #3744
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2156
2014-03-20 11:00:48 -07:00
Boris Protopopov
47fe91b54c Illumos #4088 use after free in arc_release()
4088 use after free in arc_release()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4088
  illumos/illumos-gate@ccc22e1304

From the illumos issue:

A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2162
2014-03-10 09:11:15 -07:00
Tim Chase
a45fc6a677 Use KM_PUSHPAGE in spa_add() for spa_label_features.
The spa_label_features nvlist is used in the sync context during pool
version upgrade.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2168
2014-03-10 09:09:30 -07:00
Brian Behlendorf
e74400155f Export symbols dsl_sync_task{_nowait}
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context.  Both a blocking and non-blocking
version are available to callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-03-07 10:01:36 -08:00
Ned Bass
a77c4c8332 Improve reporting of tx assignment wait times
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign().  This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.

One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded.  However, this case should not often arise in practice,
because most callers currently use one of these forms:

  dmu_tx_assign(tx, TXG_WAIT);
  dmu_tx_assign(tx, waited ?  TXG_WAITED : TXG_NOWAIT);

The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed.  Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.

Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram.  Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1915
2014-03-04 12:22:24 -08:00
Ned Bass
3ccab25205 replace nreserved with ndirty in txgs kstat
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.

Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead.  To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io().  This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io().  To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().

Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
2014-03-04 12:22:24 -08:00
Ned Bass
3d920a1567 dmu_tx kstat cleanup
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.

- The sync task restructuring commit
  13fe019870 removed the code
  that bumpted dmu_tx_quota. The counter is now bumped in two
  cases, instead of just the one case as before (after the result
  of dsl_dataset_check_quota call). The second case is where
  we check the requested reservation against the actual pool size,
  as this is an implicit quota of sorts.

- The write throttle restructuring commit
  e8b96c6007 makes dmu_tx_how and
  dmu_tx_inflight obsolete, so they are removed.

Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1914
2014-03-04 12:22:24 -08:00
Richard Yao
cecb7487fc Invalidate Linux buffer cache on vdevs upon each flush
Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.

A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.

This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.

Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2150
2014-03-04 12:22:03 -08:00
Alexander Stetsenko
36f92e93e5 Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel
4574 get_clones_stat does not call zap_count in non-debug kernel

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/4574
  illumos/illumos-gate@03d1795fa6

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2147
2014-03-04 11:50:13 -08:00
Tim Chase
13a7ba1c2c Fix zap_lookup() in feature_is_supported().
The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2141
2014-03-04 11:44:44 -08:00
Andrew Barnes
1ba1615925 Remove recursion from dsl_dir_willuse_space()
Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.

Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2069
2014-03-04 11:22:27 -08:00
Prakash Surya
2b13331d62 Set "arc_meta_limit" to 3/4 arc_c_max by default
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
cc7f677c16 Split "data_size" into "meta" and "data"
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
da8ccd0ee0 Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
77765b540b Remove "arc_meta_used" from arc_adjust calculation
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
94520ca462 Prune metadata from ghost lists in arc_adjust_meta
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
1e3cb67b53 Revert "Return -1 from arc_shrinker_func()"
This reverts commit c11a12bc3b.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
624227854e Disable arc_p adapt dampener by default
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
f521ce1b9c Allow "arc_p" to drop to zero or grow to "arc_c"
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:27 -08:00
Prakash Surya
89c8cac493 Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 14:53:28 -08:00
Prakash Surya
39e055c44b Adjust arc_p based on "bytes" in arc_shrink
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f16
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 14:53:08 -08:00
Brian Behlendorf
9141582592 Set zfs_arc_min to 4MB
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
2014-02-21 14:52:02 -08:00