Commit Graph

1146 Commits

Author SHA1 Message Date
Brian Behlendorf
8e70975f90 Wait interruptibly in prefetch thread
The Linux kernel watchdog will automatically dump a backtrace for
any process while sleeps for over 120s in an uninterruptible state.

The solution is for the prefetch thread to sleep in an interruptible
state.  The way the existing code was written this is safe because
when woken it will always reevaluate its conditional.  As a general
rule it is preferable to sleep in an interruptible when possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3450
Closes #3402
2015-06-16 16:18:11 -07:00
Brian Behlendorf
b64ccd6c52 Rename cv_wait_interruptible() to cv_wait_sig()
This is the counterpart to zfsonlinux/spl@2345368 which replaces the
cv_wait_interruptible() function with cv_wait_sig().  There is no
functional change to patch merely brings the function names in to
sync to maximize portability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3450
Issue #3402
2015-06-11 10:50:47 -07:00
Tim Chase
121b3cae74 Increase arc_c_min to allow safe operation of arc_adapt()
ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny
systems such as the raspberry pi, however, as of addition of large block
support, the arc_adapt() function depends on arc_c being >= 32MiB (2 *
SPA_MAXBLOCKSIZE).

This patch raises the minimum ARC size to 32MiB and adds a VERIFY test
to arc_adapt() for future-proofing.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
f604673836 Make arc_prune() asynchronous
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock.  This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous.  Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.

To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:

* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq.  This makes it
suitable to use in the context of the arc_adapt_thread().

* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete.  This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.

This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured.  It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
c5528b9ba6 Use taskq_wait_outstanding() function
Replace taskq_wait() with taskq_wait_oustanding().  This way callers
will only block until previously submitted tasks have been completed.
This was the previous behavior of task_wait() prior to the introduction
of taskq_wait_outstanding() so this isn't really a functionalty change
for these callers.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Prakash Surya
ca0bf58d65 Illumos 5497 - lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Porting notes and other significant code changes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.

The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.

Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:

    302f753 - Integrate ARC more tightly with Linux
    39e055c - Adjust arc_p based on "bytes" in arc_shrink
    f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
    77765b5 - Remove "arc_meta_used" from arc_adjust calculation
    94520ca - Prune metadata from ghost lists in arc_adjust_meta

Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:

    fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
    fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }

The following arcstats have been removed:

    recycle_miss - Used by arcstat.py and arc_summary.py, both of which
    have been updated appropriately.

    l2_writes_hdr_miss

The following arcstats have been added:

    evict_not_enough - Number of times arc_evict_state() was unable to
    evict enough buffers to reach its target amount.

    evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
    because it was being written to the l2arc.

    l2_writes_lock_retry - Replaces l2_writes_hdr_miss.  Number of times
    l2arc_write_done() failed to acquire hash_lock (and re-tries).

    arc_meta_min - Shows the value of the zfs_arc_meta_min module
    parameter (see below).

The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme.  Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.

The following module parameters have been added:

    zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
    before moving on to the next sub-list.

    zfs_arc_meta_min - Enforce a floor on the amount of metadata in
    the ARC.

    zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
    ARC state.

    zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
    the target size to be considered "overflowing".

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
2015-06-11 10:27:25 -07:00
Chris Williamson
b9541d6b7d Illumos 5408 - managing ZFS cache devices requires lots of RAM
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Porting notes:

Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:

    6e1d7276c9
    Fix inaccurate arcstat_l2_hdr_size calculations

        The ARC_SPACE_HDRS constant no longer exists and has been
        somewhat equivalently replaced by HDR_L2ONLY_SIZE.

    e0b0ca983d
    Add visibility in to cached dbufs

        The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
        struct requires additional structure member names to be used
        when referencing the inner items.  Also, the presence of L1 or L2
        inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
        macros.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
George Wilson
2a4324141f Illumos 5369 - arc flags should be an enum
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Porting notes:

ZoL has moved some ARC definitions into arc_impl.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Tim Chase <tim@chase2k.com>
2015-06-11 10:27:25 -07:00
Tim Chase
ad4af89561 Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches"
This reverts only the l2arc_hdr part of commit
ecf3d9b8e6 in preparation for the illumos
5497 "lock contention on arcs_mtx" patch which does the same thing
but uses the newer two-level ARC structure following the Illumos 5408
"managing ZFS cache devices requires lots of RAM" patch.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Tim Chase
97639d0a52 Revert "Allow arc_evict_ghost() to only evict meta data"
Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates
the need for this.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Tim Chase
f6b3b1f5d6 Revert "fix l2arc compression buffers leak"
This reverts commit 037763e44e in
preparation for the illumos 5497 "lock contention on arcs_mtx" patch
which includes a fix for this very problem.

ZoL had picked up a subset of the illumos 5497 patch to deal with the
l2arc compression buffer leak.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:24 -07:00
Tim Chase
7807028ccd Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc"
This reverts commit 16fcdea363 in preparation
for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates
"marker" within the ARC code.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:24 -07:00
Brian Behlendorf
44de2f02d6 Remove unused variable in vdev_add_child()
Commit c3520e7 restructured vdev_add_child() in such a way that
the spa variable was unused during non-debug builds.  This is
consistent with the upstream illumos code but because ZoL, unlike
illumos, is built with all compiler warnings enabled this causes
a legitimate warning.  Revert this hunk of the patch to keep the
build clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3432
2015-06-11 10:22:38 -07:00
Matthew Ahrens
c3520e7f1f Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector size
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Albert Lee <trisk@omniti.com>

References:
  https://www.illumos.org/issues/5818
  https://github.com/illumos/illumos-gate/commit/81cd5c5

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3432
2015-06-10 16:24:01 -07:00
Arne Jansen
9c43027b3f Illumos 5269 - zpool import slow
5269 zpool import slow
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5269
  https://github.com/illumos/illumos-gate/commit/12380e1e

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3396
2015-06-09 13:48:02 -07:00
Ned Bass
5f8e1e8505 dmu_objset_userquota_get_ids uses dn_bonus unsafely
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3443
2015-06-05 12:41:17 -07:00
Ned Bass
d617648c7f dbuf_try_add_ref minor bug fixes
- Don't check db->bb_blkid, but use the blkid argument instead.
  Checking db->db_blkid may be unsafe since we doesn't yet have a
  hold on the dbuf so its validity is unknown.

- Call mutex_exit() on found_db, not db, since it's not certain that
  they point to the same dbuf, and the mutex was taken on found_db.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3443
2015-06-05 12:40:38 -07:00
Matthew Ahrens
e5fd1dd682 Illumos 5243 - zdb -b could be much faster
5243 zdb -b could be much faster
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5243
  https://github.com/illumos/illumos-gate/commit/f7950bf

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3414
2015-05-15 11:14:54 -07:00
Jan Sanislo
79065ed5a4 Return -ESTALE to force lookup for missing NFS file handles
There seems to be a annoying problem using NFSv4 to access ZFS
file systems under certain circumstances.  It's easily reproduced:

    nfs_client1:  mount server:/export /mnt
    nfs_client1:  cd /mnt
    nfs_client1:  echo foo >junk
    nfs_client1:  cat junk
    foo

Now on a different NFSv4 client:

    nfs_client2:  mount server:/export /mnt
    nfs_client2:  cd /mnt
    nfs_client2:  vi junk
    # Make some changes to /mnt/junk and save
    # This change the inode associated with /mnt/junk

Now back to the original client:

    nfs_client1:  cat junk
    cat: junk: No such file or directory

Admittedly NFSv4 is not advertised as a cluster file system that
maintains a completely coherent view of data across multiple nodes.
But it does have some mechanisms built in that try to deal with
situations like the above.  Namely, it employs specialized file
handle lookup routines that return ESTALE when a file handle contains
a non-existant inode value.  The ESTALE return triggers a return
full file path lookup from the client to determine if the file has
actually gone away or if the cached file handle is no longer valid.
ZFS behavior can be brought into line with other file systems
(e.g., ext4) by applying the following patch:

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3224
2015-05-14 11:16:52 -07:00
Antonio Russo
7290cd3c4e Relax restriction on zfs_ioc_next_obj() iteration
Per the documentation for dnode_next_offset in dnode.c, the "txg"
parameter specifies a lower bound on which transaction the dnode can
be found in. We are interested in all dnodes that are removed between
the first and last transaction in the snapshot. It doesn't need to be
created in that snapshot to correspond to a removed file.

In fact, the behavior of zfs diff in the test case exactly matches
this: the transaction that created the data that was deleted in snapshot
"2" was produced before, in snapshot "1", definitely predating the first
transaction in snapshot "2".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com>
Closes #2081
2015-05-14 11:16:08 -07:00
Brian Behlendorf
fd0fd6467b Remove unused 'dsl_pool_t *dp' variable
When ASSERTs are compiled out by using the --disable-debug configure
option.  Then the local variable 'dsl_pool_t *dp' will be unused and
generate a compiler warning.  Since this variable is only used once
in the ASSERT replace it with 'ds->ds_dir->dd_pool'.

This has the additional advantage of potentially saving a few bytes
on the stack depending on how gcc decides to compile the function.

This issue was not noticed immediately because the automated builders
use --enable-debug to make the testing as rigorous as possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #3410
2015-05-14 11:09:47 -07:00
Max Grossman
5dc8b7365f Illumos 5765 - add support for estimating send stream size with lzc_send_space when source is a bookmark
5765 add support for estimating send stream size with lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>

References:
  https://www.illumos.org/issues/5765
  https://github.com/illumos/illumos-gate/commit/643da460

Porting notes:
* Unused variable 'recordsize' in dmu_send_estimate() dropped

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3397
2015-05-13 09:03:59 -07:00
Justin T. Gibbs
19b3b1d2a2 Illumos 5393 - spurious failures from dsl_dataset_hold_obj()
5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5393
  https://github.com/illumos/illumos-gate/commit/e1f3c20

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3403
2015-05-13 08:56:04 -07:00
Justin T. Gibbs
63b33e878a Illumos 5562 - ZFS sa_handle's violate kmem invariants, debug kernels panic on boot
5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5562
  https://github.com/illumos/illumos-gate/commit/0fda3cc5

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3388
2015-05-11 15:10:57 -07:00
Matthew Ahrens
252e1a54ab Illumos 5810 - zdb should print details of bpobj
5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5810
  https://github.com/illumos/illumos-gate/commit/732885fc

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3387
2015-05-11 15:10:24 -07:00
Matthew Ahrens
10400bfeac Illumos 5351, 5352 - scrub pauses
5351 scrub goes for an extra second each txg
5352 scrub should pause when there is some dirty data

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5351
  https://github.com/illumos/illumos-gate/commit/6f6a76a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3383
2015-05-11 15:09:56 -07:00
Matthew Ahrens
08dc1b2ddd Illumos 5350 - clean up code in dnode_sync()
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5350
  https://github.com/illumos/illumos-gate/commit/e651831

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3382
2015-05-11 15:09:51 -07:00
Alex Reece
7224c67fea Illumos 5422 - preserve AVL invariants in dn_dbufs
Author: Alex Reece <alex@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5422
  https://github.com/illumos/illumos-gate/commit/a846f19

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3381
2015-05-11 15:09:29 -07:00
David Lamparter
214806c7e9 Safely handle security / ACL failures
The security and ACL operations should all be performed atomically.
To accomplish this there would need to significant invasive changes
made to the common code base.  For the moment it's desirable for
compatibility reasons to avoid this.  Therefore the code has been
updated to attempt to unwind the operation in case of failure
rather than panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2445
2015-05-11 14:35:14 -07:00
Matthew Ahrens
f1512ee61e Illumos 5027 - zfs large block support
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
2015-05-11 12:23:16 -07:00
Brian Behlendorf
f9cab37291 Remove metaslab_min_alloc_size module option
The metaslab_min_alloc_size option is no longer used in the  code.
This functionality was removed by commit f3a7f66 and the module
options should have been dropped at that time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-11 09:51:57 -07:00
Chris Dunlop
16fcdea363 arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc
With debugging enabled and depending on your kernel config, the size of
arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost()
to greater than 1024 bytes. Let's avoid this.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3377
2015-05-08 14:14:35 -07:00
Matthew Ahrens
63e3a8616b Illumos 5349 - verify that block pointer is plausible before reading
5349 verify that block pointer is plausible before reading
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Xin Li <delphij@FreeBSD.org>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5349
  https://github.com/illumos/illumos-gate/commit/f63ab3d5

Porting notes:
* Several variable declarations were moved due to C style needs

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3373
2015-05-08 14:09:15 -07:00
Chris Dunlop
f0da4d1508 Wait for all znodes to be released before tearing down the superblock
By the time we're tearing down our superblock the VFS has started releasing
all our inodes/znodes. Some of this work may have been handed off to our
iput taskq so we need to wait for that work to complete. However the iput
from the taskq can itself result in additional work being added to the
taskq:

dsl_pool_iput_taskq
 iput
  iput_final
   evict
    destroy_inode
     zpl_inode_destroy
      zfs_inode_destroy
       zfs_iput_async(ZTOI(zp->z_xattr_parent))
        taskq_dispatch(dsl_pool_iput_taskq..., iput, ...)

Let's wait until all our znodes have been released.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3281
2015-05-06 14:13:19 -07:00
Matthew Ahrens
7a3066ffdd Illumos 5348 - zio_checksum_error() only fills in info if ECKSUM
5348 zio_checksum_error() only fills in info if ECKSUM
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5348
  https://github.com/illumos/illumos-gate/commit/373dc1cf

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3372
2015-05-05 11:05:37 -07:00
Matthew Ahrens
f3c517d814 Illumos 5820 - verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5820
  https://github.com/illumos/illumos-gate/commit/34e8acef00

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3364
2015-05-04 10:49:49 -07:00
Matthew Ahrens
36c6ffb6b6 Illumos 5808 - spa_check_logs is not necessary on readonly pools
5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5808
  https://github.com/illumos/illumos-gate/commit/23367a2f

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3369
2015-05-04 10:45:42 -07:00
Will Andrews
50f9ea0149 Illumos 5814 - bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5814
  https://github.com/illumos/illumos-gate/commit/b67dde11

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3368
2015-05-04 10:28:18 -07:00
Christopher Siden
0c60cc326b Illumos 4951 - ZFS administrative commands (fix)
4951 ZFS administrative commands should use reserved space, not fail with ENOSPC
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/4951
  https://github.com/illumos/illumos-gate/commit/c39f2c8

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:10 -07:00
Matthew Ahrens
3d45fdd6c0 Illumos 4951 - ZFS administrative commands should use reserved space
4951 ZFS administrative commands should use reserved space, not with ENOSPC
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4373
  https://github.com/illumos/illumos-gate/commit/7d46dc6

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:10 -07:00
Jerry Jelinek
a0c9a17aef Illumos 4901 - zfs filesystem/snapshot limit leaks
4901 zfs filesystem/snapshot limit leaks
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4901
  https://github.com/illumos/illumos-gate/commit/adf3407

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:09 -07:00
Matthew Ahrens
83017311e4 Illumos 3654,3656
3654 zdb should print number of ganged blocks
3656 remove unused function zap_cursor_move_to_key()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3654
  https://www.illumos.org/issues/3656
  https://github.com/illumos/illumos-gate/commit/d5ee8a1

Porting Notes:

3655 and 3657 were part of this commit but those hunks were dropped
since they apply to mdb.

Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-04 09:41:09 -07:00
tuxoko
6102d0376e Add cond_resched to zfs_zget to prevent infinite loop
It's been reported that threads would loop infinitely inside zfs_zget. The
speculated cause for this is that if an inode is marked for evict, zfs_zget
would see that and loop. However, if the looping thread doesn't yield, the
inode may not have a chance to finish evict, thus causing a infinite loop.

This patch solve this issue by add cond_resched to zfs_zget, making the
looping thread to yield when needed.

Tested-by: jlavoy <jalavoy@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3349
2015-05-04 09:12:41 -07:00
Jason Zaman
c9520ecc0f dmu: fix integer overflows
The params to the functions are uint64_t, but the offsets to memcpy
/ bcopy are calculated using 32bit ints. This patch changes them to
also be uint64_t so there isnt an overflow. PaX's Size Overflow
caught this when formatting a zvol.

Gentoo bug: #546490

PAX: offset: 1ffffb000 db->db_offset: 1ffffa000 db->db_size: 2000 size: 5000
PAX: size overflow detected in function dmu_read /var/tmp/portage/sys-fs/zfs-kmod-0.6.3-r1/work/zfs-zfs-0.6.3/module/zfs/../../module/zfs/dmu.c:781 cicus.366_146 max, count: 15
CPU: 1 PID: 2236 Comm: zvol/10 Tainted: P           O   3.17.7-hardened-r1 #1
Call Trace:
 [<ffffffffa0382ee8>] ? dsl_dataset_get_holds+0x9d58/0x343ce [zfs]
 [<ffffffff81a59c88>] dump_stack+0x4e/0x7a
 [<ffffffffa0393c2a>] ? dsl_dataset_get_holds+0x1aa9a/0x343ce [zfs]
 [<ffffffff81206696>] report_size_overflow+0x36/0x40
 [<ffffffffa02dba2b>] dmu_read+0x52b/0x920 [zfs]
 [<ffffffffa0373ad1>] zrl_is_locked+0x7d1/0x1ce0 [zfs]
 [<ffffffffa0364cd2>] zil_clean+0x9d2/0xc00 [zfs]
 [<ffffffffa0364f21>] zil_commit+0x21/0x30 [zfs]
 [<ffffffffa0373fe1>] zrl_is_locked+0xce1/0x1ce0 [zfs]
 [<ffffffff81a5e2c7>] ? __schedule+0x547/0xbc0
 [<ffffffffa01582e6>] taskq_cancel_id+0x2a6/0x5b0 [spl]
 [<ffffffff81103eb0>] ? wake_up_state+0x20/0x20
 [<ffffffffa0158150>] ? taskq_cancel_id+0x110/0x5b0 [spl]
 [<ffffffff810f7ff4>] kthread+0xc4/0xe0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170
 [<ffffffff81a62fa4>] ret_from_fork+0x74/0xa0
 [<ffffffff810f7f30>] ? kthread_create_on_node+0x170/0x170

Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3333
2015-05-04 09:12:00 -07:00
George Wilson
98b254188a Illumos #5244 - zio pipeline callers should explicitly invoke next stage
5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5244
  https://github.com/illumos/illumos-gate/commit/738f37b

Porting Notes:

1. The unported "2932 support crash dumps to raidz, etc. pools"
   caused a merge conflict due to a copyright difference in
   module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
   and additional Linux-specific changes caused merge conflicts in
   module/zfs/vdev_disk.c.

Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2828
2015-04-30 15:07:47 -07:00
Matthew Ahrens
8dd86a10cf Illumos 5812 - assertion failed in zrl_tryenter(): zr_owner==NULL
5812 assertion failed in zrl_tryenter(): zr_owner==NULL
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5812
  https://github.com/illumos/illumos-gate/commit/8df1730

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3357
2015-04-30 14:43:40 -07:00
Justin T. Gibbs
6186e29753 Illumos 5592 - NULL pointer dereference in dsl_prop_notify_all_cb()
5592 NULL pointer dereference in dsl_prop_notify_all_cb()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5592
  https://github.com/illumos/illumos-gate/commit/9d47dec

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:58 -07:00
Justin T. Gibbs
6ebebaceb1 Illumos 5531 - NULL pointer dereference in dsl_prop_get_ds()
5531 NULL pointer dereference in dsl_prop_get_ds()
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5531
  https://github.com/illumos/illumos-gate/commit/e57a022

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:44 -07:00
Justin T. Gibbs
0c66c32d1d Illumos 5056 - ZFS deadlock on db_mtx and dn_holds
5056 ZFS deadlock on db_mtx and dn_holds
Author: Justin Gibbs <justing@spectralogic.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5056
  https://github.com/illumos/illumos-gate/commit/bc9014e

Porting Notes:

sa_handle_get_from_db():
  - the original patch includes an otherwise unmentioned fix for a
    possible usage of an uninitialised variable

dmu_objset_open_impl():
  - Under Illumos list_link_init() is the same as filling a list_node_t
    with NULLs, so they don't notice if they miss doing list_link_init()
    on a zero'd containing structure (e.g. allocated with kmem_zalloc as
    here). Under Linux, not so much: an uninitialised list_node_t goes
    "Boom!" some time later when it's used or destroyed.

dmu_objset_evict_dbufs():
  - reduce stack usage using kmem_alloc()

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:34 -07:00
Justin T. Gibbs
d683ddbb72 Illumos 5314 - Remove "dbuf phys" db->db_data pointer aliases in ZFS
5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5314
  https://github.com/illumos/illumos-gate/commit/c137962

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:20 -07:00
Justin T. Gibbs
945dd93525 Illumos 5310 - Remove always true tests for non-NULL ds->ds_phys
5310 Remove always true tests for non-NULL ds->ds_phys
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5310
  https://github.com/illumos/illumos-gate/commit/d808a4f

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:25:08 -07:00
Alex Reece
9925c28cde Illumos 5095 - panic when adding a duplicate dbuf to dn_dbufs
5095 panic when adding a duplicate dbuf to dn_dbufs
Author: Alex Reece <alex@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5095
  https://github.com/illumos/illumos-gate/commit/86bb58a

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:49 -07:00
Justin T. Gibbs
5aea3644d6 Illumos 5038 - Remove "old-style" flexible array usage in ZFS.
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5038
  https://github.com/illumos/illumos-gate/commit/7f18da4

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:24 -07:00
Alex Reece
8951cb8dfb Illumos 4873 - zvol unmap calls can take a very long time for larger datasets
4873 zvol unmap calls can take a very long time for larger datasets
Author: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4873
  https://github.com/illumos/illumos-gate/commit/0f6d88a

Porting Notes:

dbuf_free_range():
  - reduce stack usage using kmem_alloc()
  - the sorted AVL tree will handle the spill block case correctly
    without all the special handling in the for() loop

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:24:03 -07:00
Jorgen Lundman
58c4aa00c6 Illumos 4975 - missing mutex_destroy() calls in zfs
4975 missing mutex_destroy() calls in zfs
Author: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Seth Nimbosa <darth.Serious@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4975
  https://github.com/illumos/illumos-gate/commit/d2b3cbb

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:23:38 -07:00
Alex Reece
ca227e54a8 Illumos 3897 - zfs filesystem and snapshot limits (fix leak)
3897 zfs filesystem and snapshot limits (fix leak)
Author: Alex Reece <alex.reece@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/fb7001f

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:23:14 -07:00
Jerry Jelinek
788eb90c4c Illumos 3897 - zfs filesystem and snapshot limits
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/a2afb61

Porting Notes:

dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-28 16:22:51 -07:00
tuxoko
ecfb0b5f42 Fix misuse of input argument in traverse_visitbp
In traverse_visitbp(), the input argument dnp is modified in the middle to
point to a temporary buffer. Originally this doesn't matter, because no user
of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of code is added
dereferencing dnp after the modification, creating a possible bug.

We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2060
2015-04-28 09:43:50 -07:00
Isaac Huang
0336f3d001 Remove useless variable spa_active_count
This isn't required for the Linux port because the kernel tracks
if a module is busy.  The prototype for spa_busy() is also removed
since its definition was already removed.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3262
2015-04-27 09:22:05 -07:00
Justin T. Gibbs
ec8501ee12 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5313
  https://github.com/illumos/illumos-gate/commit/fe319232

Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3280
2015-04-24 15:16:56 -07:00
Ned Bass
4eb30c6864 Serialize access to spa->spa_feat_stats nvlist
The function spa_add_feature_stats() manipulates the shared nvlist
spa->spa_feat_stats in an unsafe concurrent manner. Add a mutex to
protect the list.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3335
2015-04-24 15:04:43 -07:00
Chunwei Chen
07012da668 Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb)
The following panic would occur under certain heavy load:
[ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held
[ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P           OE  3.18.10 #7

The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which
would purge all tsd data for the thread. However, ZFS_ENTER is designed to be
reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data.

Instead, we rely on the new behavior of tsd_set. When NULL is passed as the
new value to tsd_set, it will automatically remove the tsd entry specified the
the key for the current thread.

rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when
they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to
do clean up. Now we explicitly call tsd_set(key, NULL) on them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3247
2015-04-24 14:57:54 -07:00
Brian Behlendorf
a438ff0e85 Extend PF_FSTRANS critical regions
Additional testing has shown that the region covered by PF_FSTRANS
needs to be extended to cover the  zpl_xattr_security_init() and
init_acl() functions.  The zpl_mark_dirty() function can also recurse
and therefore must always be protected.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3331
2015-04-24 09:54:22 -07:00
Brian Behlendorf
7fad6290eb Mark additional functions as PF_FSTRANS
Prevent deadlocks by disabling direct reclaim during all NFS, xattr,
ctldir, and super function calls.  This is related to 40d06e3.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3225
2015-04-17 09:35:24 -07:00
Tim Chase
5074bfe8ad Allocate zfs_znode_cache on the Linux slab
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.

This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.

If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory.  If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3283
2015-04-14 12:19:22 -07:00
Brian Behlendorf
f42d7f4111 Use vmem_alloc() in spa_config_write()
The packed nvlist allocated in spa_config_write() may exceed the
warning threshold for large configurations.  Use the vmem interfaces
for this short lived allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3251
2015-04-07 15:10:19 -07:00
Tim Chase
40d06e3c78 Mark all ZPL and ioctl functions as PF_FSTRANS
Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl
calls as well as the l2arc and adapt ARC threads.

This obviates the need for MUTEX_FSTRANS so its previous uses and
definition have been eliminated.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3225
2015-04-03 11:38:59 -07:00
Matthew Ahrens
0f7d2a4b3d Illumus 5693 - ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5693
  https://github.com/illumos/illumos-gate/commit/7f7ace3

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3231
2015-03-27 15:02:56 -07:00
George Wilson
b738bc5a0f Illumos 5694 - traverse_prefetcher does not prefetch enough
5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5694
  https://github.com/illumos/illumos-gate/commit/34d7ce05

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3230
2015-03-27 15:02:50 -07:00
Chris Dunlop
ee2f17aa2a Align code with Illumos
Align code in traverse_visitbp() with that in Illumos in preparation for
applying Illumos-5694.

No functional change: use a temporary variable pd to replace multiple
occurrences of td->td_pfd.  This increases our stack use slightly more
then normal because the function is called recursively.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3230
2015-03-27 14:52:45 -07:00
Prakash Surya
a4069eef2e Illumos 5695 - dmu_sync'ed holes do not retain birth time
5695 dmu_sync'ed holes do not retain birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5695
  https://github.com/illumos/illumos-gate/commit/70163ac

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3229
2015-03-27 14:51:34 -07:00
Ned Bass
58806b4cdc dbuf_free_range() overzealously frees dbufs
When called to free a spill block from a dnode, dbuf_free_range() has a
bug that results in all dbufs for the dnode getting freed.  A variety of
problems may result from this bug, but a common one was a zap lookup
tripping an ASSERT because the zap buffers had been zeroed out.  This
could happen on a dataset with xattr=sa set when extended attributes are
written and removed on a directory concurrently with I/O to files in
that directory.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #3195
Fixes #3204
Fixes #3222
2015-03-25 14:48:22 -07:00
Tim Chase
ded576e28f Set the maximum ZVOL transfer size correctly
ZoL had been setting max_sectors to UINT_MAX, but until Linux 3.19, it
the kernel artifically capped it at 1024 (BLK_DEF_MAX_SECTORS).
This cap was removed in torvalds/linux@34b48db.  This patch changes
it to DMU_MAX_ACCESS (in sectors) and also changes the ASSERT in
dmu_tx_hold_write() to allow the maximum transfer size.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3212
2015-03-25 11:58:42 -07:00
Isaac Huang
e89bd69775 zio_injection_enabled should not be a module option
The zio_inject.c keeps zio_injection_enabled as a counter of
fault handlers, so it should not be exported to user space as
a module option.

Several EXPORT_SYMBOLs are moved from zio.c to zio_inject.c,
where the symbols are defined.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3199
2015-03-24 13:22:03 -07:00
Chris Dunlop
d07b7c7f21 Reduce size of zfs_sb_t: allocate z_hold_mtx separately
zfs_sb_t has grown to the point where using kmem_zalloc() for allocations
is triggering the 32k warning threshold.

We can't safely convert this entire allocation to use vmem_alloc() instead
of kmem_alloc() because the backing_dev_info structure is embedded here.
It depends on the bit_waitqueue() function which won't behave properly
when given a virtual address.

Instead, use vmem_alloc() to allocate the z_hold_mtx array separately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #3178
2015-03-24 13:17:44 -07:00
Brian Behlendorf
bc88866657 Fix arc_adjust_meta() behavior
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Brian Behlendorf
2cbb06b561 Restructure per-filesystem reclaim
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Brian Behlendorf
596a8935a1 Fix arc_meta_max accounting
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
2015-03-20 10:35:20 -07:00
Chunwei Chen
40749aa7a6 Use MUTEX_FSTRANS on l2arc_buflist_mtx
Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock
scenario:
1. arc_release() -> hash_lock -> l2arc_buflist_mtx
2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) ->
   arc_buf_remove_ref() -> hash_lock

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3160
2015-03-18 09:29:38 -07:00
Justin T. Gibbs
4c7b7eedcd Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5630
  https://github.com/illumos/illumos-gate/commit/cd485b4

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek
73ad4a9f3c Illumos 5047 - don't use atomic_*_nv if you discard the return value
5047 don't use atomic_*_nv if you discard the return value
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5047
  https://github.com/illumos/illumos-gate/commit/640c167

Porting Notes:

Several hunks from the original patch where not specific to ZFS
and thus were dropped.

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:33 -07:00
Brian Behlendorf
7f3e466283 Mark zfs_inactive() with PF_FSTRANS
Allowing direct reclaim to re-enter the VFS in the zfs_inactive()
call path has historically been problematic for ZoL.  Therefore,
in order to avoid an entire class of current and future issues
caused by this PF_FSTRANS is set for all zfs_inactive() callers.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3163
2015-03-10 09:21:48 -07:00
Ned Bass
417104bdd3 Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa.  It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
2015-03-05 14:11:10 -08:00
Brian Behlendorf
989fd514b1 Change ASSERT(!"...") to cmn_err(CE_PANIC, ...)
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1445
2015-03-03 13:22:21 -08:00
Brian Behlendorf
8c45def24a Linux 4.0 compat: bdi_setup_and_register()
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed.  File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled.  A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.

* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.

I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported.  All kernels newer than
this will have one of the above interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3128
2015-03-03 10:49:45 -08:00
Brian Behlendorf
4ec15b8dcf Use MUTEX_FSTRANS mutex type
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by using a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch
'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3050
Closes #3055
Closes #3062
Closes #3132
Closes #3142
Closes #2983
2015-03-03 10:46:40 -08:00
Isaac Huang
d14cfd83da Fix deadlock between zpool export and zfs list
Pool reference count is NOT checked in spa_export_common()
if the pool has been imported readonly=on, i.e. spa->spa_sync_on
is FALSE. Then zpool export and zfs list may deadlock:

1. Pool A is imported readonly.
2. zpool export A and zfs list are run concurrently.
3. zfs command gets reference on the spa, which holds a dbuf on
   on the MOS meta dnode.
4. zpool command grabs spa_namespace_lock, and tries to evict dbufs
   of the MOS meta dnode. The dbuf held by zfs command can't be
   evicted as its reference count is not 0.
5. zpool command blocks in dnode_special_close() waiting for the
   MOS meta dnode reference count to drop to 0, with
   spa_namespace_lock held.
6. zfs command tries to get the spa_namespace_lock with a reference
   on the spa held, which holds a dbuf on the MOS meta dnode.
7. Now zpool command and zfs command deadlock each other.

Also any further zfs/zpool command will block on spa_namespace_lock
forever.

The fix is to always check pool reference count in spa_export_common(),
no matter whether the pool was imported readonly or not.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2034
2015-03-02 11:50:06 -08:00
Brian Behlendorf
87a63dd702 Prevent "zpool destroy|export" when suspended
Cleanly destroying or exporting a pool requires that the pool
not be suspended.  Therefore, set the POOL_CHECK_SUSPENDED flag
for these ioctls so the utilities will output a descriptive
error message rather than block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2878
2015-03-02 11:50:06 -08:00
Brian Behlendorf
b4f3666a16 Retire spl_module_init()/spl_module_fini()
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup.  This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2985
2015-02-24 11:37:44 -08:00
Brian Behlendorf
1efdc45ea8 Fix O_APPEND open(2) flag
As described in flags section of open(2):

  O_APPEND:
    The  file  is  opened in append mode.  Before each write(2), the
    file offset is positioned at the end of the  file,  as  if  with
    lseek(2).   O_APPEND may lead to corrupted files on NFS filesys-
    tems if more than one process appends data to a  file  at  once.
    This is because NFS does not support appending to a file, so the
    client kernel has to simulate it, which can't be done without  a
    race condition.

This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem.  However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3124
2015-02-24 11:21:54 -08:00
Dan Swartzendruber
1611bb7b4f Set zfs_autoimport_disable default value to 1
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file.  This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.

1) modprobe
2) zpool import
3) zfs mount
4) zfs share

This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side.  It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2820
2015-02-17 16:09:41 -08:00
Brian Behlendorf
7d2868d5fc Skip bad DVAs during free by setting zfs_recover=1
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal.  This indicates that somehow a damaged
DVA was written to disk and that should be impossible.

However, we have seen a handful of reports over the years of pools
somehow being damaged in this way.  Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA.  Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.

Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3099
Issue #3090
Issue #2720
2015-02-13 16:02:04 -08:00
Andrey Vesnovaty
5f15fa2216 Fix readdir for .zfs/snapshot directory
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.

This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2930
2015-02-10 16:34:30 -08:00
Brian Behlendorf
3941503c0a Retire zio_cons()/zio_dest()
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue #2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:09:49 -08:00
Brian Behlendorf
6442f3cfe3 Retire zio_bulk_flags
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches.  Today
this code works well and there's no compelling reason to keep
this functionality.  In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:08:49 -08:00
Jörg Thalheim
534759fad3 Linux 3.19 compat: file_inode was added
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)

Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3084
2015-02-10 11:24:51 -08:00
Brian Behlendorf
77aef6f60e Use vmem_alloc() for nvlists
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold.  Convert them to use vmem_alloc() so the
best allocator is used.

Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings.  Concurrently the large allocation warning threshold
was increased from 8k to 32k.  The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k.  This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3057
Closes #3079
Closes #3081
2015-02-10 11:00:08 -08:00
Brian Behlendorf
afe373260e Revert "Don't read space maps during import for readonly pools"
This reverts commit 7fc8c33ede which
accidentally introduced a ztest failure.

ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-02-09 16:56:59 -08:00
Brian Behlendorf
7fc8c33ede Don't read space maps during import for readonly pools
Normally when importing a pool the space maps for all top level
vdevs are read from disk.  The space maps will be required latter
when an allocation is performed and free blocks need to be located.

However, if the pool is imported readonly then we are guaranteed
that no allocations can occur.  In this case the space maps need
not be loaded..   A similar argument can be made for the DTLs
(dirty time logs).

Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
2015-02-09 16:43:03 -08:00
Justin T. Gibbs
33b4de513e Illumos 5311 - traverse_dnode may report success when it should not
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/2a89c2c
  https://www.illumos.org/issues/5311

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2970
2015-02-06 12:07:15 -08:00
Ned Bass
a62d1b02e3 Fix SA header size accounting
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.

  ln -s $(perl -e 'print "z" x 20') file
  setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file

Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.

Also improve the comments in sa_find_sizes().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3070
2015-02-06 09:26:46 -08:00
Brian Behlendorf
e2c4acde55 Skip evicting dbufs when walking the dbuf hash
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list.  In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2553
Closes #2495
2015-02-06 09:24:28 -08:00
Tim Chase
aa2ef419e4 Spurious ENOMEM returns when reading dbufs kstat
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller.  This caused spurious
ENOMEM errors when reading the dbufs kstat.

This commit causes the actual number of characters written to be returned.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3072
2015-02-04 16:35:26 -08:00
avg
037763e44e fix l2arc compression buffers leak
Commit log from FreeBSD:

    We have observed that arc_release() can be called concurrently with a
    l2arc in-flight write.  Also, we have observed that arc_hdr_destroy()
    can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
    flag in similar circumstances.

    Previously the l2arc headers would be freed while leaking their
    associated compression buffers.  Now the buffers are placed on
    l2arc_free_on_write list for delayed freeing.  This is similar to
    what was already done to arc buffers that were supposed to be freed
    concurrently with in-flight writes of those buffers.

    In addition to fixing the discovered leaks this change also adds
    some protective code to assert that a compression buffer associated
    with a l2arc header is never leaked.

    A new kstat l2_cdata_free_on_write is added.  It keeps a count
    of delayed compression buffer frees which previously would have
    been leaks.

    Tested by:  Vitalij Satanivskij <satan@ukr.net> et al
    Requested by:       many
    MFC after:  2 weeks
    Sponsored by:       HybridCluster / ClusterHQ

References:
    https://illumos.org/issues/5222
    https://github.com/freebsd/freebsd/commit/b98f85d
    http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781
    http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html
    http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3029
2015-02-03 16:54:16 -08:00
Brian Behlendorf
19ea3d25df Use zio buffers in zil_itx_create()
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K.  This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.

In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure.  This way we're always sure the
right allocation functions are used.  Notice that in the current
code kmem_free() and vmem_free() were both used.  This happened to
work because these wrappers map to the same internal SPL function.

This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings.  There
was no deadlock here just repeated allocation failures resulting
in increased latency.

Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P           O 3.2.0-0.bpo.4-amd64
Call Trace:
 [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
 [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
 [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
 [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
 [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
 [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
 [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
 [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
 [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3059
2015-02-02 11:20:41 -08:00
Brian Behlendorf
0365064a97 Handle closing an unopened ZVOL
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool.  Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().

In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:14 -08:00
Brian Behlendorf
a127e841de Add zvol_open() error handling for readonly property
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:06 -08:00
Tim Chase
b0cf0676c0 Fix removal of SA in sa_modify_attrs()
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR().  The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028
2015-01-21 16:35:14 -08:00
Richard Yao
841c9d43c7 Use kmem_vasprintf() in log_internal()
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.

It also brings this function almost back in sync with Illumos.  This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781
2015-01-21 15:30:24 -08:00
Tim Chase
3c832b8cc1 Linux 3.12 compat: split shrinker has s_shrink
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.

This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2975
2015-01-20 14:07:59 -08:00
Brian Behlendorf
81971b137a Revert "SA spill block cache"
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
285b29d959 Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf
efcd79a883 Retire KM_NODEBUG
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:40:32 -08:00
Richard Yao
71f8548ea4 Use is_vmalloc_addr() in vdev_disk.c
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
92119cc259 Mark IO pipeline with PF_FSTRANS
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
d958324f97 Fix zfs_putpage() lock inversion (again)
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range().  Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read().  The page must be unlocked before taking
the range lock.  This patch corrects that issue.

In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
2015-01-08 16:09:41 -08:00
Ned Bass
33b6dbbc51 Document zfs_flags module parameter
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter.  Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.

zfs_flags (int)
            Set  additional debugging flags. The following flags may be
            bitwise-or'd together.

            +-------------------------------------------------------+
            |Value   Symbolic Name                                  |
            |        Description                                    |
            +-------------------------------------------------------+
            |    1   ZFS_DEBUG_DPRINTF                              |
            |        Enable dprintf entries in the debug log.       |
            +-------------------------------------------------------+
            |    2   ZFS_DEBUG_DBUF_VERIFY *                        |
            |        Enable extra dbuf verifications.               |
            +-------------------------------------------------------+
            |    4   ZFS_DEBUG_DNODE_VERIFY *                       |
            |        Enable extra dnode verifications.              |
            +-------------------------------------------------------+
            |    8   ZFS_DEBUG_SNAPNAMES                            |
            |        Enable snapshot name verification.             |
            +-------------------------------------------------------+
            |   16   ZFS_DEBUG_MODIFY                               |
            |        Check for illegally modified ARC buffers.      |
            +-------------------------------------------------------+
            |   32   ZFS_DEBUG_SPA                                  |
            |        Enable spa_dbgmsg entries in the debug log.    |
            +-------------------------------------------------------+
            |   64   ZFS_DEBUG_ZIO_FREE                             |
            |        Enable verification of block frees.            |
            +-------------------------------------------------------+
            |  128   ZFS_DEBUG_HISTOGRAM_VERIFY                     |
            |        Enable extra spacemap histogram verifications. |
            +-------------------------------------------------------+
            * Requires debug build.

            Default value: 0.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2988
2015-01-07 15:50:49 -08:00
Ned Bass
49ee64e5e6 Remove duplicate typedefs from trace.h
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.

Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.

These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
2015-01-06 16:53:24 -08:00
Brian Behlendorf
74328ee18f Fix zfs_putpage() lock inversion
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock.  To
prevent this we must always manipulate the writeback bit while
holding the range lock.  The exact deadlock is as follows:

------ Process A ------        ------ Process B ------
zpl_writepages                 zpl_fallocate
write_cache_pages              zpl_fallocate_common
zpl_putpage                    zfs_space
zfs_putpage (set bit)          zfs_freesp
zfs_range_lock (wait on lock)  zfs_free_range (take lock)
[has not yet initiated I/O,    truncate_inode_pages_range
the bit will not be cleared]   wait_on_page_writeback (wait on bit)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
2014-12-22 09:31:56 -08:00
Boris Protopopov
9063f65476 Correct error returns to unify cross-pool operation error handling
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2911
2014-12-19 10:51:24 -08:00
Brian Behlendorf
c944be5d7e Fix snapshots with dirty inodes
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode.  This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812
2014-11-20 10:38:16 -08:00
Isaac Huang
29b763cd2c bio_alloc() with __GFP_WAIT never returns NULL
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL.  However, we want to keep
the error handling in case this behavior changes in the futre.

Plus fix a small style issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #2703
2014-11-19 12:50:49 -05:00
Ned Bass
aaed7c408c Explicitly include SPL compat headers
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages.  Include a
few compatiblity headers explicitly to cope with that change.  Also,
sort some linux-specific inclusions alphabetically.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2898
2014-11-19 12:30:39 -05:00
Ned Bass
7b2d78a046 Fix improper null-byte termination handling
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.

- The snprintf() function always produces a null-byte terminated string
  for non-negative return values, so it is not necessary to write out a
  null-byte as a separate step.

- Also, it is unsafe to use the return value of snprintf() as an offset
  for placing a null-byte, because if the output was truncated the return
  value is the number of bytes that _would_ have been written had enough
  space been available. Therefore the return value may index beyond the
  array boundaries.

- Finally, snprintf() accounts for the null-byte when limiting its output
  size, so there is no need to pass it a size parameter that is one less
  than the buffer size.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2875
2014-11-17 15:28:59 -08:00
smh
89b1cd6581 Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.

References:
  https://github.com/freebsd/freebsd/commit/48431b7
  http://svnweb.freebsd.org/base?view=revision&revision=273158
  https://reviews.csiden.org/r/132/

Porting notes:

This issue was filed as illumos 5347 and a more comprehensive fix is
under review.  Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2896
2014-11-17 11:35:38 -08:00
Tim Chase
4254acb057 Undirty freed spill blocks.
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
2014-11-17 11:25:48 -08:00
Prakash Surya
0b39b9f96f Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.

The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:

    * A number of external tools can make use of our tracepoints
      "automatically" (e.g. perf, systemtap)
    * Tracepoints are designed to be extremely cheap when disabled
    * It's one of the "accepted" ways to export this kind of
      information; many other kernel subsystems use tracepoints too.

Unfortunately, though, there are a few caveats as well:

    * Linux tracepoints appear to only be available to GPL licensed
      modules due to the way certain kernel functions are exported.
      Thus, to actually make use of the tracepoints introduced by this
      patch, one might have to patch and re-compile the kernel;
      exporting the necessary functions to non-GPL modules.

    * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
      tracepoints are not available for unsigned kernel modules
      (tracepoints will get disabled due to the module's 'F' taint).
      Thus, one either has to sign the zfs kernel module prior to
      loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
      newer.

Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):

    # list all zfs tracepoints available

    $ ls /sys/kernel/debug/tracing/events/zfs
    enable              filter              zfs_arc__delete
    zfs_arc__evict      zfs_arc__hit        zfs_arc__miss
    zfs_l2arc__evict    zfs_l2arc__hit      zfs_l2arc__iodone
    zfs_l2arc__miss     zfs_l2arc__read     zfs_l2arc__write
    zfs_new_state__mfu  zfs_new_state__mru

    # enable all zfs tracepoints, clear the tracepoint ring buffer

    $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
    $ echo 0 > /sys/kernel/debug/tracing/trace

    # import zpool called 'tank', inspect tracepoint data (each line was
    # truncated, they're too long for a commit message otherwise)

    $ zpool import tank
    $ cat /sys/kernel/debug/tracing/trace | head -n35
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 1219/1219   #P:8
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
            lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
          z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
          z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
          z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
            lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
          z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
          z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
            lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
          z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
          z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
            lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
            lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...

To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:

    lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
        hdr {
            dva 0x1:0x40082
            birth 15491
            cksum0 0x163edbff3a
            flags 0x640
            datacnt 1
            type 1
            size 2048
            spa 3133524293419867460
            state_type 0
            access 0
            mru_hits 0
            mru_ghost_hits 0
            mfu_hits 0
            mfu_ghost_hits 0
            l2_hits 0
            refcount 1
        } bp {
            dva0 0x1:0x40082
            dva1 0x1:0x3000e5
            dva2 0x1:0x5a006e
            cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
            lsize 2048
        } zb {
            objset 0
            object 0
            level -1
            blkid 0
        }

For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.

For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:

    * http://lwn.net/Articles/379903/
    * http://lwn.net/Articles/381064/
    * http://lwn.net/Articles/383362/

I should also node the more "boring" aspects of this patch:

    * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
       support a sixth paramter. This parameter is used to populate the
       contents of the new conftest.h file. If no sixth parameter is
       provided, conftest.h will be empty.

    * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
      This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
      except it has support for a fifth option that is then passed as
      the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.

These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).

    * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
      is to determine if we can safely enable the Linux tracepoint
      functionality. We need to selectively disable the tracepoint code
      due to the kernel exporting certain functions as GPL only. Without
      this check, the build process will fail at link time.

In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:55 -08:00
Ned Bass
29e57d15c8 Fix dprintf format specifiers
Fix a few dprintf format specifiers that disagreed with their argument
types.  These came to light as compiler errors when converting dprintf
to use the Linux trace buffer.  Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:45 -08:00
Ned Bass
59ec819a0c Move a few internal ARC strucutres to arc_impl.h
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-17 11:13:27 -08:00
Prakash Surya
fb42a49328 Illumos 5213 - panic in metaslab_init due to space_map_open returning ENXIO
5213 panic in metaslab_init due to space_map_open returning ENXIO
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com

References:
  https://www.illumos.org/issues/5213
  https://reviews.csiden.org/r/110

Porting notes:

For the Linux port, KM_SLEEP was replaced with KM_PUSHPAGE.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2745
2014-11-14 15:37:45 -08:00
Chris Wedgwood
b31d8ea77c Reduce buf/dbuf mutex contention
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.

This increase in hash table size adds approximating 0.5M to
our fixed memory footprint.  This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized.  In the meanwhile, this small
change significantly reduces contention for certain workloads.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #1291
2014-11-14 14:59:21 -08:00
Alex Zhuravlev
0f69910833 Export symbols for ZIL interface
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL.  In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.

Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2892
2014-11-14 14:39:43 -08:00
Tim Chase
ed6e9cc235 Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2837
2014-10-28 09:34:51 -07:00
Matthew Ahrens
9635861742 Illumos 5164-5165 - space map fixes
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
     space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5164
  https://www.illumos.org/issues/5165
  https://github.com/illumos/illumos-gate/commit/b1be289

Porting Notes:

The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.

The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz.  The upstream commit incorrectly
used space_map_blksize.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Alex Reece
b02fe35d37 Illumos 4958 zdb trips assert on pools with ashift >= 0xe
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4958
  https://github.com/illumos/illumos-gate/commit/2a104a5

Porting notes:

Keep the ZIO_FLAG_FASTWRITE define.  This is for a feature present
in Linux but not yet in *BSD.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2697
2014-10-23 15:30:32 -07:00
Brian Behlendorf
5f6d0b6f5a Handle block pointers with a corrupt logical size
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything.  This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block.  If a blocks checksum is valid then its contents are
trusted by the higher layers.

This system works exceptionally well as long as bad data is never
written with a valid checksum.  If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.

One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.

To prevent this from happening the arc_read() function has been
updated to detect this specific case.  If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2678
2014-10-23 09:20:52 -07:00
Ned Bass
bc151f7b31 Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
2014-10-22 11:06:53 -07:00
Matthew Ahrens
88904bb3e3 Illumos 5162 - zfs recv should use loaned arc buffer to avoid copy
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5162
  https://github.com/illumos/illumos-gate/commit/8a90470

Porting notes:
  Fix spelling error 's/arena/area/' in dmu.c.
  In restore_write() declare bonus and abuf at the top of the function.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2696
2014-10-21 16:32:11 -07:00
Matthew Ahrens
4b20a6f509 Illumos 5150 - zfs clone of a defer_destroy snapshot causes strangeness
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!

Steps to reproduce:

  * zpool create test c1t1d0
  * zfs create test/fs
  * zfs snapshot test/fs@a
  * zfs clone test/fs@a test/clone
  * zfs destroy -d test/fs@a
  * zfs clone test/fs@a test/clone2
  * zfs snapshot test/clone2@a
  * zfs hold hld test/clone2@a
  * zfs release hld test/clone2@a
  * zfs list -r -t all test

  <test/clone2@a has been destroyed>

We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.

5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5150
  https://github.com/illumos/illumos-gate/commit/42fcb65

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2690
2014-10-21 15:26:58 -07:00
Matthew Ahrens
6c59307a3c Illumos 3693 - restore_object uses at least two transactions to restore an object
Restore_object should not use two transactions to restore an object:
  * one transaction is used for dmu_object_claim
  * another transaction is used to set compression, checksum and most
    importantly bonus data
  * furthermore dmu_object_reclaim internally uses multiple transactions
  * dmu_free_long_range frees chunks in separate transactions
  * dnode_reallocate is executed in a distinct transaction

The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups.  Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.

3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon

References:
  https://www.illumos.org/issues/3693
  https://github.com/illumos/illumos-gate/commit/e77d42e

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2689
2014-10-21 15:26:50 -07:00
Tim Chase
356d9ed4c8 Don't perform ACL-to-mode translation on empty ACL
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL.  If no ACL exists, this
processing shouldn't be done.  Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:

	# create a file with typical mode of 664
	echo test > test
	chown anyuser test
	ls -l test

and the mode will show up as all zeroes.  Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1264
2014-10-21 09:23:27 -07:00
Daniil Lunev
62bdd5eb7a Illumos 4924 - LZ4 Compression for metadata
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://github.com/illumos/illumos-gate/commit/b8289d2
  https://www.illumos.org/issues/3756

Porting notes:

The static function zfs_prop_activate_feature() was removed because
this change removes the only caller.  The function was not removed
from Illumos but instead left as dead code.  However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1540
2014-10-20 16:17:49 -07:00
Brian Behlendorf
ba232d8aea Suppress AIO kmem warnings
The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO.  While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2774
2014-10-20 16:10:25 -07:00
Brian Behlendorf
33074f2254 Handle NULL mirror child vdev
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev.  In
this case the child should be skipped and the read issues to
another member of the mirror.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1744
2014-10-17 14:59:05 -07:00
Brian Behlendorf
f0e324f25d Update utsname support
Modify the code to use the utsname() kernel function rather than
a global variable.  This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2).  This means that it
will behave consistently in both contexts.

This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions.  And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.

Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:57 -07:00
Brian Behlendorf
050d22b068 Remove shrink_dcache_memory() and shrink_icache_memory()
This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface.  Therefore, this functionality is being dropped
for earlier kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
2014-10-17 14:58:50 -07:00
Brian Behlendorf
e6659763c6 Improve VERIFY() error in dmu_write()
This is a debug patch designed to ensure an error code is logged
to the console when this VERIFY() is hit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #1440
2014-10-08 09:18:14 -07:00
Brian Behlendorf
8878261fc9 Fix CPU_SEQID use in preemptible context
Commit e022864 introduced a regression for kernels which are built
with CONFIG_DEBUG_PREEMPT.  The use of CPU_SEQID in a preemptible
context causes zio_nowait() to trigger the BUG.  Since CPU_SEQID
is simply being used as a random index the usage here is safe. To
resolve the issue preempt is disable while calling CPU_SEQID.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2769
2014-10-07 16:40:29 -07:00
Matthew Ahrens
e022864d19 Illumos 5176 - lock contention on godfather zio
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5176
  https://github.com/illumos/illumos-gate/commit/6f834bc

Porting notes:

Under Linux max_ncpus is defined as num_possible_cpus().  This is
largest number of cpu ids which might be available during the life
time of the system boot.  This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2711
2014-10-07 11:24:24 -07:00
Richard Yao
83e9986f6e Implement -t option to zpool create for temporary pool names
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.

26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.

This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
2014-09-30 10:46:59 -07:00
Tim Chase
cb08f06307 Perform whole-page page truncation for hole-punching under a range lock
As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161f
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp().  This left a window of opportunity during which
the full pages could be touched.

This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2733
2014-09-29 09:22:03 -07:00
Max Grossman
36283ca233 Illumos 5138 - add tunable for maximum number of blocks freed in one txg
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5138
  https://github.com/illumos/illumos-gate/commit/af3465d

Porting notes:

Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long.  This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.

The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos.  This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2675
Closes #2581
2014-09-23 14:26:34 -07:00
Alex Reece
acbad6ff67 Illumos 4753 - increase number of outstanding async writes when sync task is waiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/4753
    https://github.com/illumos/illumos-gate/commit/73527f4

Comments by Matt Ahrens from the issue tracker:
    When a sync task is waiting for a txg to complete, we should hurry
    it along by increasing the number of outstanding async writes
    (i.e. make vdev_queue_max_async_writes() return a larger number).
    Initially we might just have a tunable for "minimum async writes
    while a synctask is waiting" and set it to 3.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2716
2014-09-23 13:50:55 -07:00
Matthew Ahrens
d97aa48f7c Illumos 5139 - SEEK_HOLE failed to report a hole at end of file
5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5139
  https://github.com/illumos/illumos-gate/commit/0fbc0cd

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2714
2014-09-23 10:38:45 -07:00
Richard Yao
485c581c41 Fix function call with uninitialized value in vdev_inuse
LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
2014-09-23 10:32:45 -07:00
Matthew Ahrens
b8bcca18f7 Illumos 5161 - add tunable for number of metaslabs per vdev
5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5161
  https://github.com/illumos/illumos-gate/commit/bf3e216

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2698
2014-09-23 10:00:02 -07:00
Matthew Ahrens
ebcf49365a Illumos 5177 - remove dead code from dsl_scan.c
5177 remove dead code from dsl_scan.c
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5177
  https://github.com/illumos/illumos-gate/commit/5f37736

Porting notes:

The local variable 'buf' was removed from dsl_scan_visitbp().
This wasn't part of the original patch but it should have been.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2712
2014-09-22 15:52:58 -07:00
Adam Leventhal
64dbba3679 Illumos 5174 - add sdt probe for blocked read in dbuf_read()
5174 add sdt probe for blocked read in dbuf_read()
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5174
  https://github.com/illumos/illumos-gate/commit/f6164ad

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2710
2014-09-22 14:20:25 -07:00
Matthew Ahrens
6d9036f350 Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5140
  https://github.com/illumos/illumos-gate/commit/2243853

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2676
2014-09-18 15:04:59 -07:00
Brian Behlendorf
2d50158343 Fix z_teardown_inactive_lock deadlock
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock.  This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().

Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock.  Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2670
2014-09-11 09:11:45 -07:00
Tim Chase
223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
George Wilson
4f68d7878f Illumos 5117 - spacemap reallocation can cause corruption
5117 space map reallocation can cause corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/projects/illumos-gate/issues/5117
  https://github.com/illumos/illumos-gate/commit/e503a68

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2662
2014-09-08 09:42:39 -07:00
Brian Behlendorf
ceb49b0acd Add object type checking to zap_lockdir()
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object.  This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().

The good news is that is a situation which can not occur unless the
pool has been damaged.  The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools.  Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.

Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object.  If a non-ZAP
object has been passed it EINVAL will be returned immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
2014-09-08 09:15:38 -07:00
Richard Yao
cd3939c5f0 Linux AIO Support
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #223
Closes #2373
2014-09-05 15:11:43 -07:00
Alex Reece
f38dfec3fd Illumos 5049 - panic when removing log device
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Approved by: Rich Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5049
  https://github.com/illumos/illumos-gate/commit/2986efa

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2636
2014-09-05 09:06:27 -07:00
Stanislav Seletskiy
2078f21015 Fix invalid locking order in rename operation
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2652.
Closes #2525.
2014-09-04 09:50:46 -07:00
Alexey Smirnoff
0dfc732416 Change the default 'zfs_dedup_prefetch' value to '0'
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2639
2014-09-04 09:50:45 -07:00
Dan Swartzendruber
287be44f53 Improve handling of filesystem versions
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.

Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
2014-09-03 09:17:14 -07:00
Matthew Ahrens
dea377c0d9 Illumos 4970-4974 - extreme rewind enhancements
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4970
  https://www.illumos.org/issues/4971
  https://www.illumos.org/issues/4972
  https://www.illumos.org/issues/4973
  https://www.illumos.org/issues/4974
  https://github.com/illumos/illumos-gate/commit/e42d205

Notes:
    This set of patches adds a set of tunable parameters for the
    "extreme rewind" mode of pool import which allows control over
    the traversal performed during such an import.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2598
2014-08-26 16:29:57 -07:00
Matthew Ahrens
49ddb31506 Illumos 5034 - ARC's buf_hash_table is too small
5034 ARC's buf_hash_table is too small

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5034
  https://github.com/illumos/illumos-gate/commit/63e911b

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2615
2014-08-26 16:14:49 -07:00
Isaac Huang
0426c16804 Fixed memory leaks in zevent handling
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2158
2014-08-20 10:45:16 -07:00
Matthew Ahrens
bd089c5477 Illumos 4631 - zvol_get_stats triggering too many reads
4631 zvol_get_stats triggering too many reads

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4631
  https://github.com/illumos/illumos-gate/commit/bbfa8ea

Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2612
Closes #2480
2014-08-20 09:17:00 -07:00
Tim Chase
8b0a0840b4 Don't upgrade a metaslab when the pool is not writable
Illumos 4982 added code to metaslab_fragmentation() to proactively update
space maps when the spacemap_histogram feature is enabled.  This should
only happen when the pool is writeable.

References:
  https://www.illumos.org/issues/4982
  https://github.com/illumos/illumos-gate/commit/2e4c998

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:47:19 -07:00
George Wilson
f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Brian Behlendorf
0d5c500d6c Revert "Revert "Revert "Fix unlink/xattr deadlock"""
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations.  This
was possible due to the following related changes.

e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
Closes #2058
Closes #2128
Closes #2240
2014-08-11 16:12:36 -07:00
Brian Behlendorf
0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Brian Behlendorf
4dd18932ba Avoid 128K kmem allocations in mzap_upgrade()
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available.  Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir().  This is shown below.

The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required.  This could be
done but it would result in us diverging further from the other
implementations.  Therefore I've opted against doing this
unless it becomes absolutely necessary.

Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close #2580
2014-08-11 16:10:32 -07:00
Brian Behlendorf
50b25b2187 Avoid dynamic allocation of 'search zio'
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage.  Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.

To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure.  All
access must be serialized through the vq_lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2572
2014-08-11 08:44:54 -07:00
Brian Behlendorf
ab6f407faa Use KM_PUSHPAGE in dsl_dataset_rollback_check()
The dsl_dataset_rollback_check() function is executed in the
txg_sync context.  To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE.  This was introduced by
the recent 'zfs bookmark' features, commit da53684.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes #2569
2014-08-06 16:09:28 -07:00
Matthew Ahrens
5dbd68a352 Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t
4914 zfs on-disk bookmark structure should be named *_phys_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4914
  https://github.com/illumos/illumos-gate/commit/7802d7b

Porting notes:

There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated.  This should reduce the likelihood of further
problems like issue #2094 from occurring.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2558
2014-08-06 14:48:41 -07:00
Matthew Ahrens
1fa8f795d5 Illumos 4881 - zfs send performance regression with embedded data
4881 zfs send performance degradation when embedded block pointers
     are encountered

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4881
  https://github.com/illumos/illumos-gate/commit/06315b7

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2547
2014-08-06 14:44:10 -07:00
Saso Kiselkov
3bec585e6c Illumos 4897 - Space accounting mismatch in L2ARC/zpool
4897 Space accounting mismatch in L2ARC/zpool

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

From the illumos issue tracker:

	L2ARC vdev space usage statistics are calculated as the delta
	between the maximum and minimum vdev offset ever written to
	by the L2ARC fill thread, but do not inform the user of how
	much space in between these two offsets is actually taken up by
	cached buffers. This fix changes that so that vdev space usage
	stats on L2ARC devices accurately track the volume of buffers
	stored on them, allowing users to see the exact L2ARC usage in
	"zpool iostat -v".

References:
  https://www.illumos.org/issues/4897
  https://github.com/illumos/illumos-gate/commit/3038a2b

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2555
2014-08-06 13:44:10 -07:00
Matthew Ahrens
fbeddd60b7 Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4390
  https://github.com/illumos/illumos-gate/commit/7fd05ac

Porting notes:

Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code.  This patch should reduce its
stack footprint a bit more.

The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.

The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global.  Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2545
2014-08-04 11:50:52 -07:00
Matthew Ahrens
9b67f60560 Illumos 4757, 4913
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4757
  https://www.illumos.org/issues/4913
  https://github.com/illumos/illumos-gate/commit/5d7b4d4

Porting notes:

For compatibility with the fastpath code the zio_done() function
needed to be updated.  Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2544
2014-08-01 14:28:05 -07:00
Matthew Ahrens
faf0f58c69 Illumos 3835 zfs need not store 2 copies of all metadata
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Description from Matt Ahrens's bug report at Delphix:

    Add a new zfs property, "redundant_metadata" which can have values
    "all" or "most".  The default will be "all", which is the current
    behavior.  Setting to "most" will cause us to only store 1 copy of
    level-1 indirect blocks of user data files.

Additional notes:

    The new man page section for this property states

        "The exact behavior of which metadata blocks
         are stored redundantly may change in future releases."

    and:

        "When set to most, ZFS stores an extra copy of most types of
         metadata. This can improve performance of random writes,
         because less metadata must be written."

    The current implementation is as described above in Matt's blog.
    It is controlled by a new global integer
    "zfs_redundant_metadata_most_ditto_level", currently initialized
    to 2. When "redundant_metadata" is set to "most", only indirect
    blocks of the specified level and higher will have additional ditto
    blocks created.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2542
2014-07-31 09:49:34 -07:00
George Wilson
672692c7b7 Illumos 4754, 4755
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4754
  https://www.illumos.org/issues/4755
  https://github.com/illumos/illumos-gate/commit/b6240e8

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2533
2014-07-30 10:30:05 -07:00
Matthew Ahrens
9bd274ddd8 Illumos #4374
4374 dn_free_ranges should use range_tree_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4374
  https://github.com/illumos/illumos-gate/commit/bf16b11

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2531
2014-07-30 09:20:35 -07:00
Matthew Ahrens
da536844d5 Illumos 4368, 4369.
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4369
  https://www.illumos.org/issues/4368
  https://github.com/illumos/illumos-gate/commit/78f1710

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2530
2014-07-29 10:55:29 -07:00
Max Grossman
b0bc7a84d9 Illumos 4370, 4371
4370 avoid transmitting holes during zfs send
4371 DMU code clean up

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4370
  https://www.illumos.org/issues/4371
  https://github.com/illumos/illumos-gate/commit/43466aa

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2529
2014-07-28 14:29:58 -07:00
Matthew Ahrens
fa86b5dbb6 Illumos 4171, 4172
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features

Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4171
  https://www.illumos.org/issues/4172
  https://github.com/illumos/illumos-gate/commit/2acef22

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2528
2014-07-25 16:40:07 -07:00
Brian Behlendorf
7a8f0e80ea zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT)
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2488
2014-07-22 09:41:38 -07:00
George Wilson
080b310015 Illumos #4756 Fix metaslab_group_preload deadlock
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.

References:
  https://www.illumos.org/issues/4756
  https://github.com/illumos/illumos-gate/commit/30beaff

Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:41:32 -07:00
George Wilson
3c51c5cb1f Illumos #4730 destroy metaslab group taskq
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()

Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4730
  https://github.com/illumos/illumos-gate/commit/be08211

Porting notes:

Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:41:06 -07:00
George Wilson
93cf20764a Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.

This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.

The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram

In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:

    * 4K sector devices will not see any compression benefit
    * large space_maps require more metadata on-disk
    * large space_maps require more time to load (typically random reads)

Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.

A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.

References:
  https://www.illumos.org/issues/4101
  https://www.illumos.org/issues/4102
  https://www.illumos.org/issues/4103
  https://www.illumos.org/issues/4105
  https://www.illumos.org/issues/4106
  https://github.com/illumos/illumos-gate/commit/0713e23

Porting notes:

A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:39:16 -07:00
Prakash Surya
1be627f5c2 Move metaslab_group_alloc_update() call
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac3ea.

This aligns us more closely to the upstream illumos code base.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-07-22 09:38:16 -07:00
Brian Behlendorf
1e8db77102 Fix zil_commit() NULL dereference
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot.  If they do somehow get
dirtied an attempt will make made to write them to disk.  In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2405
2014-07-17 15:15:07 -07:00
George Wilson
2fbc542ebd Illumos 4168, 4169, 4170: ztest, zdb and zhack fixes
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
    https://www.illumos.org/issues/4168
    https://www.illumos.org/issues/4169
    https://www.illumos.org/issues/4170
    https://github.com/illumos/illumos-gate/commit/7fdd916

Porting notes:

Of particular interest when troubleshooting corrupted pools, the
commonly-used "zdb -e" operation may perform verbatim imports and
furthermore, it will soon have direct support for verbatim imports via
a new "-V" option.  The 4169 fix eliminates a common segfault case in
which spa_history_log_version() tries to access an un-opened dsl_pool_t.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2451
Closes #2283
Closes #2467
2014-07-17 11:37:57 -07:00
Tim Chase
f4a4046bd6 Convert zfs_mg_noalloc_threshold to a module parameter and document
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea.  This patch
documents the parameter and allows for it to be set as a module parameter.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2483
2014-07-16 16:49:25 -07:00
Andrew Barnes
61e99a73bc Preserve asize when last mirror child promoted to top-level vdev
If the smaller of 2 different sized child vdev's of a mirrored vdev is
detached, and the pool has the autoexpand property set to off, as the
remaining larger vdev is promoted to a top level vdev it fails to retain
the asize of the original top level mirror vdev and therefore partially
autoexpands.

This partially autoexpanded state leaves the new vdev too large to
re-mirror by adding the smaller vdev back in, and the pool fails to
utilize the space until next imported.

If the autoexpand property is set to on, the child vdev grows
in size after it has been promoted to a top level vdev as expected.

This commit causes the remaining child mirror to retain the asize of its
old parent mirror vdev if the autoexpand property is set to off,
this allows the smaller vdev to be re-added if required the vdev
can then be told to expand if required by the usual using zpool online -e.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #1208
2014-07-02 14:04:29 -07:00
Dan McDonald
ee4712284c Illumos #4936 fix potential overflow in lz4
4936 lz4 could theoretically overflow a pointer with a certain input
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported by: Tim Chase <tim@chase2k.com>

References:
  https://illumos.org/issues/4936
  https://github.com/illumos/illumos-gate/commit/58d0718

Porting notes:

This fixes the widely-reported "20-year-old vulnerability" in
LZO/LZ4 implementations which inherited said bug from the reference
implementation.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2429
2014-07-01 14:10:47 -07:00
Tim Chase
4240dc332d Comment the lack of real_LZ4_uncompress()
Added several comments regarding the removal of real_LZ4_uncompress()
which exists in the reference implementation but has been removed here
since it's not used.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-07-01 14:09:56 -07:00
Brian Behlendorf
7f6884f419 Revert "Fix __zio_execute() asynchronous dispatch"
This reverts commit 91579709fc which
limited the asynchronous dispatch to kernel space.  We want to do
this for two reasons:

1) While we have slightly more headroom in user space excessively
   deep stacks have been observed while running ztest, see #2293.

2) Removing this conditional makes the pipeline behave consistently
   regardless of if it's executing in kernel space or user space.
   This way we're more likely to uncover subtle issues with ztest.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2384
2014-06-11 16:32:57 -07:00
Brian Behlendorf
6795a698f4 Use default slab types
We should not override the default memory type of the kmem cache.  This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.

The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64.  This means there is long longer
a need to override the default for the caches.  And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.

The following is a list of caches that will be impacted:

                  | object size   | forced type   | default type
----------------- | ------------- | ------------- | --------------
dnode_t           | 936 bytes     | KMC_KMEM      | KMC_KMEM
zio_cache         | 1104 bytes    | *KMC_KMEM     | *KMC_VMEM
zio_link_cache    | 48 bytes      | KMC_KMEM      | KMC_KMEM
zio_vdev_cache    | 131088 bytes  | KMC_VMEM      | KMC_VMEM
zio_buf_512       | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_data_buf_512  | 512 bytes     | KMC_KMEM      | KMC_KMEM
zio_buf_1024      | 1024 bytes    | KMC_KMEM      | KMC_KMEM
zio_data_buf_1024 | 1024 bytes    | +KMC_VMEM     | +KMC_KMEM

* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.

This patch removes another slight point of divergence between ZoL
and Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2337
2014-05-22 10:39:52 -07:00
HC
f9a1ac4d59 Honor zfs_nocacheflush for file vdevs
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.

Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2336
2014-05-19 13:30:48 -07:00
Tim Chase
83021b47c2 Calculate header size correctly in sa_find_sizes()
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA.  This problem could be
reproduced when xattr=sa enabled as follows:

	ln -s $(perl -e 'print "x" x 120') blah
	setfattr -n security.selinux -v blahblah -h blah

The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):

	[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]

Adding the SA xattr will attempt to extend the registration to:

	[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]

but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:

	[ ZPL_SYMLINK ZPL_DXATTR ]

This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
2014-05-19 11:55:50 -07:00
Tim Chase
3937ab20f3 Allow for lock-free reading zfsdev_state_list.
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed.  It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2301
2014-05-19 11:45:11 -07:00
Chunwei Chen
bc25c9325b Use a dedicated taskq for vdev_file
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.

We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2270
2014-05-14 16:20:21 -07:00
Brian Behlendorf
2c33b91275 Handle vdev_lookup_top() failure in dva_get_dsize_sync()
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail.  However, the NULL dereference at
clearly shows that under certain circumstances it is possible.  Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.

  BUG: unable to handle kernel NULL pointer dereference at 00000570

  crash> struct -o vdev
  struct vdev {
       [0] uint64_t vdev_id;
       ... ...
    [1376] uint64_t vdev_deflate_ratio;

Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes #1707
Closes #1987
Closes #1891

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-06 10:41:48 -07:00
Tim Chase
962d524212 Check the dataset type more rigorously when fetching properties.
When fetching property values of snapshots, a check against the head
dataset type must be performed.  Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".

This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes.  It also did not allow for the display of
"volsize" of a snapshot of a volume.

This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly.  This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2265
2014-05-06 10:41:46 -07:00
Brian Behlendorf
1ce0457348 Fix style
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-06 10:41:17 -07:00
George Wilson
aa7d06a98a Illumos #4101 finer-grained control of metaslab_debug
Today the metaslab_debug logic performs two tasks:

- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync

This change provides knobs for each of these independently.

References:
  https://illumos.org/issues/4101
  https://github.com/illumos/illumos-gate/commit/0713e23

Notes:

1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2227
2014-05-06 09:46:04 -07:00
Brian Behlendorf
cc79a5c263 Treat spill block dbufs as meta data
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated.  These spill blocks are currently considered data
blocks.  However, they should be accounted for as meta data
because they are effectively an extension of the dnode.

While this may seem like a minor accounting issue it has broader
implications.  The key thing to be aware of is that each spill
block will hold a reference on its parent dnode.  The dnode in
turn holds a reference on its dbuf in the dnode object.  This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data.  This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.

However, unlike the small file case a spill block can legitimately
be considered meta data.  By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached.  This then allows the dnodes and dbufs which the spill
block was pinning to be released.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #2294
2014-05-05 13:56:59 -07:00
Brian Behlendorf
12f9a6a3f9 dmu_tx_assign() should not return ENOMEM
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space.  If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned.  Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.

/*
 * Assign tx to a transaction group.  txg_how can be one of:
 *
 * (1)  TXG_WAIT.  If the current open txg is full, waits until there's
 *      a new one.  This should be used when you're not holding locks.
 *      It will only fail if we're truly out of space (or over quota).
 * ...
 */

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2287
2014-05-01 12:08:53 -07:00
Richard Yao
9d317793aa Implement File Attribute Support
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().

Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method.  That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.

References:
  https://bugs.gentoo.org/show_bug.cgi?id=483516

Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1691
2014-05-01 10:11:18 -07:00
Chunwei Chen
17584980b9 Add assertion to catch 0-count page
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.

This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2277
2014-04-25 15:41:19 -07:00
Ned Bass
de39ec11b8 Fix LZ4 endianness autodetection
Endianness detection in LZ4 is broken in user-space builds.  This
bug corrupts compressed data and manifests itself in several ztest
failures.  When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.

The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system.  On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.

While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #1963
Fixes #1964
Fixes #1965
2014-04-20 16:55:42 -07:00
Brian Behlendorf
4fd762f8ad Fix zfsdev_ioctl() kmem leak warning
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue.  All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.

In this case the code used strfree() to release the memory allocated
by kmem_alloc().  Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.

To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory.  Like strfree(), strdup() is
not integrated with the memory accounting.  This means we can use
strfree() to release it like Illumos.

  SPL: kmem leaked 10/4368729 bytes
  address          size  data             func:line
  ffff880067e9aa40 10    ZZZZZZZZZZ       zfsdev_ioctl:5655

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #2262
2014-04-18 13:30:15 -07:00
DHE
2dbedf5484 Uninitialized variable spa_autoreplace used
Caught by ztest and valgrind.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2259
2014-04-16 10:59:24 -07:00
Chunwei Chen
0b75bdb369 Use ddi_time_after and friends to compare time
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2142
2014-04-14 13:27:56 -07:00
Chunwei Chen
b761912b34 Linux 3.14 compat: rq_for_each_segment in dmu_req_copy
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:51 -07:00
Chunwei Chen
22760eebef Revert "Fix zvol+btrfs hang"
After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.

This reverts commit e26ade5101.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:47 -07:00
Chunwei Chen
215b4634c7 Refactor dmu_req_copy for immutable biovec changes
Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:43 -07:00
Chunwei Chen
d4541210f3 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:38 -07:00
Chunwei Chen
408ec0d2e1 Linux 3.14 compat: posix_acl_{create,chmod}
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:27:03 -07:00
Richard Yao
f3ad9cd67a Fix locking order in zfs_zget()
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-04 09:12:47 -07:00
Richard Yao
6f9548c487 Fix deadlock in zfs_zget()
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
2014-04-04 09:11:54 -07:00
Brian Behlendorf
8ac67298b1 Revert "Fixed a use-after-free bug in zfs_zget()."
This reverts commit 36df284366.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-03 16:23:28 -07:00
Brian Behlendorf
904ea2763e Add automatic hot spare functionality
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.

To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev.  This covers both io and checksum zevents but
may be used but other scripts.

In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:

  vdev_spare_paths  - List of vdev paths for all hot spares.
  vdev_spare_guids  - List of vdev guids for all hot spares.
  vdev_read_errors  - Read errors for the problematic vdev
  vdev_write_errors - Write errors for the problematic vdev
  vdev_cksum_errors - Checksum errors for the problematic vdev.

By default the required hot spare scripts are installed but this
functionality is disabled.  To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.

These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system.  It also requires that the
autoexpand policy be ported from Illumos, see:

  https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c

Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
2014-04-02 13:10:08 -07:00
Brian Behlendorf
9b101a7320 Clarify zpool_events_next() comment
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location.  When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor.  Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor.  When the file descriptor is
closed the kernel state tracking the cursor is destroyed.

This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:11:08 -07:00
Brian Behlendorf
75e3ff58fe Add zpool_events_seek() functionality
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID.  When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:10:57 -07:00
Brian Behlendorf
a2f1945ee3 Add a unique "eid" value to all zevents
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents.  By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
2014-03-31 16:10:41 -07:00
Boris Protopopov
0ed212dc0e Illumos #4089 NULL pointer dereference in arc_read()
4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4089
  illumos/illumos-gate@57815f6b95

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2171
Issue #2165
Closes #2198
2014-03-24 11:06:57 -07:00
Richard Yao
26b42f3f9d Implement -t option to zpool import for temporary pool names
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2189
2014-03-20 12:05:30 -07:00
Andrey Vesnovaty
00fcdee1f8 Fix regression introduced in port of Illumos #3744
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2156
2014-03-20 11:00:48 -07:00
Boris Protopopov
47fe91b54c Illumos #4088 use after free in arc_release()
4088 use after free in arc_release()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4088
  illumos/illumos-gate@ccc22e1304

From the illumos issue:

A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #2162
2014-03-10 09:11:15 -07:00
Tim Chase
a45fc6a677 Use KM_PUSHPAGE in spa_add() for spa_label_features.
The spa_label_features nvlist is used in the sync context during pool
version upgrade.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2168
2014-03-10 09:09:30 -07:00
Brian Behlendorf
e74400155f Export symbols dsl_sync_task{_nowait}
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context.  Both a blocking and non-blocking
version are available to callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-03-07 10:01:36 -08:00
Ned Bass
a77c4c8332 Improve reporting of tx assignment wait times
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign().  This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.

One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded.  However, this case should not often arise in practice,
because most callers currently use one of these forms:

  dmu_tx_assign(tx, TXG_WAIT);
  dmu_tx_assign(tx, waited ?  TXG_WAITED : TXG_NOWAIT);

The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed.  Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.

Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram.  Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1915
2014-03-04 12:22:24 -08:00
Ned Bass
3ccab25205 replace nreserved with ndirty in txgs kstat
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.

Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead.  To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io().  This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io().  To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().

Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
2014-03-04 12:22:24 -08:00
Ned Bass
3d920a1567 dmu_tx kstat cleanup
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.

- The sync task restructuring commit
  13fe019870 removed the code
  that bumpted dmu_tx_quota. The counter is now bumped in two
  cases, instead of just the one case as before (after the result
  of dsl_dataset_check_quota call). The second case is where
  we check the requested reservation against the actual pool size,
  as this is an implicit quota of sorts.

- The write throttle restructuring commit
  e8b96c6007 makes dmu_tx_how and
  dmu_tx_inflight obsolete, so they are removed.

Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1914
2014-03-04 12:22:24 -08:00
Richard Yao
cecb7487fc Invalidate Linux buffer cache on vdevs upon each flush
Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.

A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.

This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.

Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2150
2014-03-04 12:22:03 -08:00
Alexander Stetsenko
36f92e93e5 Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel
4574 get_clones_stat does not call zap_count in non-debug kernel

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/4574
  illumos/illumos-gate@03d1795fa6

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2147
2014-03-04 11:50:13 -08:00
Tim Chase
13a7ba1c2c Fix zap_lookup() in feature_is_supported().
The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2141
2014-03-04 11:44:44 -08:00
Andrew Barnes
1ba1615925 Remove recursion from dsl_dir_willuse_space()
Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.

Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2069
2014-03-04 11:22:27 -08:00
Prakash Surya
2b13331d62 Set "arc_meta_limit" to 3/4 arc_c_max by default
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..

Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).

This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.

If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.

So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.

To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).

Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:

    - 16K for the dnode object's block - [        16384 bytes]
    - N * sizeof(dnode_t) -------------- [      N * 928 bytes]
    - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) *  72 bytes]
    - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
    - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]

To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:

    Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
    Pinned "arc_meta_used" bytes = 17000 + N * 1544

This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.

Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:

    Total = 17000 + N * 1544 + S * N
            ^^^^^^^^^^^^^^^^   ^^^^^
                metadata        data

So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).

                                  File Sizes (S)
       |    512   |   1024   |   2048   |   4096   |   8192   |   16384  |
    ---+----------+----------+----------+----------+----------+----------+
     1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
     2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
 N   4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
     8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
    16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
    32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |

As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.

This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
cc7f677c16 Split "data_size" into "meta" and "data"
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.

To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
da8ccd0ee0 Prioritize "metadata" in arc_get_data_buf
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
77765b540b Remove "arc_meta_used" from arc_adjust calculation
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
94520ca462 Prune metadata from ghost lists in arc_adjust_meta
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.

This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".

Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.

In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
1e3cb67b53 Revert "Return -1 from arc_shrinker_func()"
This reverts commit c11a12bc3b.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
624227854e Disable arc_p adapt dampener by default
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:49 -08:00
Prakash Surya
f521ce1b9c Allow "arc_p" to drop to zero or grow to "arc_c"
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 16:10:27 -08:00
Prakash Surya
89c8cac493 Disable aggressive arc_p growth by default
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 14:53:28 -08:00
Prakash Surya
39e055c44b Adjust arc_p based on "bytes" in arc_shrink
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753f16
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
2014-02-21 14:53:08 -08:00
Brian Behlendorf
9141582592 Set zfs_arc_min to 4MB
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
2014-02-21 14:52:02 -08:00
Richard Yao
4f2dcb3eee Add erratum for issue #2094
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit.  Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.

The additional field has been removed by the previous commit.  However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied.  This will return the pools to a state in which they
may be imported by other implementations.

The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted.  A message similar to one of the following means your
pool must be scrubbed to restore compatibility.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #1 detected.
 action: The pool can be imported using its name or numeric identifier,
         however there is a compatibility issue which should be corrected
         by running 'zpool scrub'
    see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

$ zpool status
  pool: zol-0.6.2-173
 state: ONLINE
  scan: pool compatibility issue detected.
   see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...

If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported.  Further advice on how to proceed will be
provided by the error message as follows.

$ zpool import
   pool: zol-0.6.2-173
     id: 1165955789558693437
  state: ONLINE
 status: Errata #2 detected.
 action: The pool can not be imported with this version of ZFS due to an
         active asynchronous destroy.  Revert to an earlier version and
         allow the destroy to complete before updating.
         see: http://zfsonlinux.org/msg/ZFS-8000-ER
 config:
 ...

Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:

zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.

The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys.  Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.

This patch does have one limitation that should be mentioned.  It will not
detect errata #2 on a pool unless errata #1 is also present.  It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.

End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL.  The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root.  This will display a non-zero
value for any pool with an active asynchronous destroy.

Lastly, it is expected that no user data has been lost as a result of
this erratum.

Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
2014-02-21 12:10:40 -08:00
Brian Behlendorf
ffe9d38275 Add generic errata infrastructure
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool.  These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate.  The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation.  Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.

To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected.  This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.

To add a new errata the following changes must be made:

* A new errata identifier must be assigned by adding a new enum value
  to the zpool_errata_t type.  New enums must be added to the end to
  preserve the existing ordering.

* Code must be added to detect the issue.  This does not strictly
  need to be done at pool import time but doing so will make the
  errata visible in 'zpool import' as well as 'zpool status'.  Once
  detected the spa->spa_errata member should be set to the new enum.

* If possible code should be added to clear the spa->spa_errata member
  once the errata has been resolved.

* The show_import() and status_callback() functions must be updated
  to include an informational message describing the errata.  This
  should include an action message describing what an administrator
  should do to address the errata.

* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
  updated to describe the errata.  This space can be used to provide
  as much additional information as needed to fully describe the errata.
  A link to this documentation will be automatically generated in the
  output of 'zpool import' and 'zpool status'.

Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
2014-02-21 12:10:40 -08:00
Richard Yao
ed9e8368d3 Revert changes to zbookmark_t
Commit 1421c89142 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.

The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
2014-02-21 12:10:39 -08:00
Tim Chase
98fad86293 Propagate errors when registering "relatime" property callback.
Various errors can occur when registering property callbacks.  As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks.  This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.

Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2117
2014-02-12 09:38:28 -08:00
Brian Behlendorf
c5cb66addc Fix corrupted l2_asize in arcstats
Commit e0b0ca9 accidentally corrupted the l2_asize displayed in
arcstats.  This was caused by changing the l2arc_buf_hdr.b_asize
member from an int to uint32_t type.  There are places in the
code where this field is cast to a uint64_t resulting in the
b_hits member being treated as part of b_asize.

To resolve the issue the type has been changed to a uint64_t,
and the b_hits member is placed after the enum to prevent the
size of the structure from increasing.

This is a good example of exactly why it's a bad idea to use
ambiguous types (int) in these structures.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1990
2014-02-05 12:24:53 -08:00
Matthew Ahrens
2e7b7657cd 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Refences:
  https://www.illumos.org/issues/4188
  illumos/illumos-gate@bb411a08b0

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2091
2014-01-31 10:49:34 -08:00
Matthew Ahrens
8b4646494c Illumos 4504 traverse_visitbp: visit group before user
4504 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>

References:
  https://illumos.org/issues/4504
  http://code.delphix.com/illumos-4504
  http://svnweb.freebsd.org/base?view=revision&revision=260812

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2079
2014-01-29 15:50:49 -08:00
Tim Chase
6d111134c0 Implement relatime.
Add the "relatime" property.  When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access.  This behavior
is compatible with the Linux "relatime" mount option.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2064
Closes #1917
2014-01-29 15:50:44 -08:00
Cyril Plisko
01b738f457 Call gethrtime() only once per new txg creation
When transitioning current open TXG into QUIESCE state and opening
a new one txg_quiesce() calls gethrtime():
  - to mark the birth time of the new TXG
  - to record the SPA txg history kstat
  - implicitely inside spa_txg_history_add()

These timestamps are practically the same, so that the first one
can be used instead of the other two.  The only visible difference
is that inside spa_txg_history_add() the time spent in kmem_zalloc()
will be counted towards the opened TXG.

Since at this point the new TXG already exists (tx->tx_open_txg
has been already incremented) it is actually a correct accounting.

In any case this extra work is only happening when spa_txg_history
kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect
the normal processing in any way.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Issue #2075
2014-01-23 13:31:51 -08:00
Igor Lvovsky
478d64fdae Add additional state TXG_STATE_WAIT_FOR_SYNC for txg.
In several cases when digging into kstats we can found two txgs
in SYNC state, e.g.

txg     birth            state  nreserved  nread      nwritten ...
985452  258127184872561  C      0          373948416  2376272384 ...
985453  258129016180616  C      0          378173440  28793344 ...
985454  258129016271523  S      0          0          0 ...
985455  258130864245986  S      0          0          0 ...
985456  258130867458851  O      0          0          0 ...

However only first txg (985454) is really syncing at this moment.
The other one (985455) marked as SYNCED is actually in a post-QUIESCED
state and waiting to start sync.   So, the new TXG_STATE_WAIT_FOR_SYNC
state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to
reveal this situation.

txg     birth            state  nreserved  nread      nwritten ...
1086896 235261068743969  C      0          163577856  8437248 ...
1086897 235262870830801  C      0          280625152  822594048 ...
1086898 235264172219064  S      0          0          0 ...
1086899 235264936134407  W      0          0          0 ...
1086900 235264936296156  O      0          0          0 ...

Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2075
2014-01-23 13:31:51 -08:00
Shen Yan
93292b3081 Use enum type(zfetch_dirn_t) instead
Fix code with zfetch_dirn_t, which is more readable and clear.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2068
2014-01-23 12:56:33 -08:00
Tim Chase
4461aa6118 Allow chown/chgrp when no ACL SAs exist.
From the comment in the commit:

Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES
SA in which case ENOENT is returned from zfs_acl_node_read() when the
SA can't be located.  Allow chown/chgrp to succeed in these cases rather
than returning an error that makes no sense in the context of the caller.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfs-osx/zfs#86
Closes #1911
Closes #2029
2014-01-23 11:07:29 -08:00
Ned Bass
04aa2de8f7 vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on.  In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock.  To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1928
2014-01-23 09:58:07 -08:00
Brian Behlendorf
35d3e32274 Use long holds in zvol_set_volsize()
Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele().  Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.

This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice.  The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.

This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2039
2014-01-14 14:46:12 -08:00
Brian Behlendorf
fd23720ae1 Drain iput taskq outside z_teardown_lock
It's unsafe to drain the iput taskq while holding the z_teardown_lock
as a writer.  This is because when the last reference on an inode is
dropped it may still have pages which need to be written to disk.
This will be done through zpl_writepages which will acquire the
z_teardown_lock as a reader in ZFS_ENTER.  Therefore, if we're
holding the lock as a writer in zfs_sb_teardown the unmount will
deadlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #1988
2014-01-09 15:54:08 -08:00
Brian Behlendorf
4fcc43790c Force LZ4_FORCE_SW_BITCOUNT for Sparc
This change was proposed for Sparc but it's not clear to me
why it's required.  Proper support exists in the lz4 code to
detect the endianness and the required builtins are available
for gcc.  Still I'm including the patch because it will only
impact Sparc and it may resolve a case which hasn't occured
to me.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
2014-01-09 15:54:03 -08:00
Brian Behlendorf
b585bc4afa Fix zfs_getattr_fast types
On Sparc sp->blksize will be a 64-bit value which is then cast
incorrectly to a 32-bit value.  For big endian systems this
results in an incorrect value for sp->blksize.  To resolve the
problem local variables of the correct size are used and then
assigned to sp->blksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
2014-01-09 15:50:23 -08:00
Brian Behlendorf
7f89ae6ba0 Use local variable to read zp->z_mode
When accessing the zp->z_mode through the SA bulk interface we
expect that 64-bits are available to hold the result.  However,
on 32-bit platforms mode_t will only be 32-bits so we cannot
pass it to SA_ADD_BULK_ATTR().  Instead a local uint64_t variable
must be used and the result assigned to zp->z_mode.

This went unnoticed on 32-bit little endian platforms because
the bytes happen to end up in the correct 32-bits.  But on big
endian platforms like Sparc the zp->z_mode will always end up
set to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
2014-01-09 15:50:11 -08:00
John Layman
ecf3d9b8e6 Add ddt, ddt_entry, and l2arc_hdr caches
Back the allocations for ddt tables+entries and l2arc headers with
kmem caches.  This will reduce the cost of allocating these commonly
used structures and allow for greater visibility of them through the
/proc/spl/kmem/slab interface.

Signed-off-by: John Layman <jlayman@sagecloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1893
2014-01-07 10:33:11 -08:00
Tim Chase
fb8e608d9d Fix the creation of ZPOOL_HIST_CMD pool history entries.
Move the libzfs_fini() after the zpool_log_history() call so the
ZPOOL_HIST_CMD entry can get written.

Fix the handling of saved_poolname in zfsdev_ioctl()
which was broken as part of the stack-reduction work in
a168788053.

Since ZoL destroys the TSD data in which the previously successful
ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY
ioctl has a very important restriction: it can only successfully write
a long entry following a successful ioctl() if no intervening vops have
been performed.  Some of zfs subcommands do perform intervening vops and
to do the logging themselves. At the moment, the "create" and "clone"
subcommands have been modified appropriately.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1998
2014-01-07 09:00:26 -08:00
Tim Chase
5d862cb0d9 Properly handle updates of variably-sized SA entries.
During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.

Previously, a size-changing update of a variably-sized SA that occurred
when there were other variably-sized SAs in the bonus buffer would cause
the subsequent SAs to be corrupted.  The most common case in which this
would occur is when a mode change caused the ZPL_DACL_ACES entry to
change size when a ZPL_DXATTR (SA xattr) entry already existed.

The following sequence would have caused a failure when xattr=sa was in
force and would corrupt the bonus buffer:

	open(filename, O_WRONLY | O_CREAT, 0600);
	...
	lsetxattr(filename, ...);	/* create xattr SA */
	chmod(filename, 0650);		/* enlarges the ACL */

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1978
2013-12-20 13:52:33 -08:00
Michael Kjorling
d1d7e2689d cstyle: Resolve C style issues
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821
2013-12-18 16:46:35 -08:00
Turbo Fredriksson
fd8febbd1e Add zfs_send_corrupt_data module option
Tuning setting to ignore read/checksum errors when sending data.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1982
Issue #1897
2013-12-18 16:46:35 -08:00
Chunwei Chen
7dc71949f2 Fix z_sync_cnt decrement in zfs_close
The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.

Here is an example of what's wrong:

Process A		B
	open(O_SYNC)
	z_sync_cnt = 1
			open(O_SYNC)
			z_sync_cnt = 2
	close()
	z_sync_cnt = 0

So z_sync_cnt is 0 even if B still has the file with O_SYNC.

Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented.  This
is safe because generic_file_open has no side effects.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962
2013-12-17 10:28:27 -08:00
Brian Behlendorf
ce37ebd2eb cstyle: zvol.c
Update zvol.c to conform to the style guidelines, verified by
running cstyle.pl on the source file.  This patch contains
no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #1821
2013-12-16 09:41:45 -08:00
Brian Behlendorf
2e0358cbca Sync /dev/zfs ioctl ordering
In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.

1) Sync ZoL's ioctl ordering such that it matches Illumos.  For
   historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
   ioctls were out of order.

2) Move Linux and FreeBSD specific ioctls in to their own reserved
   ranges.  This allows us to preserve the existing ordering when
   new ioctls are added by either Illumos or FreeBSD.  When an
   ioctl is no longer needed it should be retired in place.

This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules.  However, it should allow for a
much stabler interface going forward.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1973
2013-12-16 09:41:39 -08:00
Brian Behlendorf
ba6a24026c Remove ZFC_IOC_*_MINOR ioctl()s
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace.  This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel.  This significantly simplified the code.

However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux.  This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel.  Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes.  This
will minimize, but not eliminate, the disruption to end users.

Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code.  While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.

NOTES:

1) This patch introduces one subtle change in behavior which
   could not be easily avoided.  Prior to this change callers
   of 'zfs create -V ...' were guaranteed that upon exit the
   /dev/zvol/ block device link would be created or an error
   returned.  That's no longer the case.  The utilities will no
   longer block waiting for the symlink to be created.  Callers
   are now responsible for blocking, this is why a 'udev_wait'
   call was added to the 'label' function in scripts/common.sh.

2) The read-only behavior of a ZVOL now solely depends on if
   the ZVOL_RDONLY bit is set in zv->zv_flags.  The redundant
   policy setting in the gendisk structure was removed.  This
   both simplifies the code and allows us to safely leverage
   set_disk_ro() to issue a KOBJ_CHANGE uevent.  See the
   comment in the code for futher details on this.

3) Because __zvol_create_minor() and zvol_alloc() may now be
   called in a sync task they must use KM_PUSHPAGE.

References:
  illumos/illumos-gate@681d9761e8

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1969
2013-12-16 09:15:57 -08:00
George Wilson
dda12da9f1 Illumos #4121 vdev_label_init read only
4121 vdev_label_init should treat request as succeeded when pool
     is read only
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/4121
  illumos/illumos-gate@973c78e94b

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1863
2013-12-12 10:24:01 -08:00
Tim Chase
84b0aac5fd Fix atime handling.
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime.  However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.

This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.

It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.

The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
2013-12-12 10:23:58 -08:00
david.chen
be5db977ea Remove MAX when initializing arc_c_max
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.

The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.

Signed-off-by: david.chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1941
2013-12-10 10:05:40 -08:00
Ned Bass
b6e335bfc4 Revert "Use directory xattrs for symlinks"
This reverts commit 6a7c0ccca4.

A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1648
2013-12-10 09:48:30 -08:00
James Pan
472e7c6085 sa_find_sizes() may compute wrong SA header size
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header.  This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.

The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode.  The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.

A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648).  Note the
corrupt link target name:

$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec  6 15:40 link -> 90123456701234567

Commit 6a7c0ccca4 worked around this bug
by forcing xattr's on symlinks to be stored in directory format.  This
change implements a proper fix, so the workaround can now be reverted.

The reference link below contains a reproducer for FreeBSD.

References:
  http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1890
2013-12-10 09:48:15 -08:00
Brian Behlendorf
90ee9ed32f Fix 'zfs diff' shares error
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented.  Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.

However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist.  Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.

There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist.  This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.

This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes #1426
Closes #481
2013-12-06 09:42:39 -08:00
Brian Behlendorf
99e349db92 Add module versioning
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion".  See:

  /sys/module/zavl/version
  /sys/module/zavl/srcversion
  /sys/module/znvpair/version
  /sys/module/znvpair/srcversion
  /sys/module/zunicode/version
  /sys/module/zunicode/srcversion
  /sys/module/zcommon/version
  /sys/module/zcommon/srcversion
  /sys/module/zfs/version
  /sys/module/zfs/srcversion
  /sys/module/zpios/version
  /sys/module/zpios/srcversion

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1923
2013-12-06 09:34:41 -08:00
Matthew Ahrens
e8b96c6007 Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver.  The scheduler
issues a number of concurrent i/os from each class to the device.  Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes).  The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is.  See the block comment in vdev_queue.c (reproduced
below) for more details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load.  The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system.  When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount.  This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens.  One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes.  See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently.  There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due".  One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os.  This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future.  If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due.  Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).

Notes on porting to ZFS on Linux:

- zio_t gained new members io_physdone and io_phys_children.  Because
  object caches in the Linux port call the constructor only once at
  allocation time, objects may contain residual data when retrieved
  from the cache. Therefore zio_create() was updated to zero out the two
  new fields.

- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
  (vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
  This tree has been replaced by vq->vq_active_tree which is now used
  for the same purpose.

- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
  the number of vdev I/O buffers to pre-allocate.  That global no longer
  exists, so we instead use the sum of the *_max_active values for each of
  the five I/O classes described above.

- The Illumos implementation of dmu_tx_delay() delays a transaction by
  sleeping in condition variable embedded in the thread
  (curthread->t_delay_cv).  We do not have an equivalent CV to use in
  Linux, so this change replaced the delay logic with a wrapper called
  zfs_sleep_until(). This wrapper could be adopted upstream and in other
  downstream ports to abstract away operating system-specific delay logic.

- These tunables are added as module parameters, and descriptions added
  to the zfs-module-parameters.5 man page.

  spa_asize_inflation
  zfs_deadman_synctime_ms
  zfs_vdev_max_active
  zfs_vdev_async_write_active_min_dirty_percent
  zfs_vdev_async_write_active_max_dirty_percent
  zfs_vdev_async_read_max_active
  zfs_vdev_async_read_min_active
  zfs_vdev_async_write_max_active
  zfs_vdev_async_write_min_active
  zfs_vdev_scrub_max_active
  zfs_vdev_scrub_min_active
  zfs_vdev_sync_read_max_active
  zfs_vdev_sync_read_min_active
  zfs_vdev_sync_write_max_active
  zfs_vdev_sync_write_min_active
  zfs_dirty_data_max_percent
  zfs_delay_min_dirty_percent
  zfs_dirty_data_max_max_percent
  zfs_dirty_data_max
  zfs_dirty_data_max_max
  zfs_dirty_data_sync
  zfs_delay_scale

  The latter four have type unsigned long, whereas they are uint64_t in
  Illumos.  This accommodates Linux's module_param() supported types, but
  means they may overflow on 32-bit architectures.

  The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
  likely to overflow on 32-bit systems, since they express physical RAM
  sizes in bytes.  In fact, Illumos initializes zfs_dirty_data_max_max to
  2^32 which does overflow. To resolve that, this port instead initializes
  it in arc_init() to 25% of physical RAM, and adds the tunable
  zfs_dirty_data_max_max_percent to override that percentage.  While this
  solution doesn't completely avoid the overflow issue, it should be a
  reasonable default for most systems, and the minority of affected
  systems can work around the issue by overriding the defaults.

- Fixed reversed logic in comment above zfs_delay_scale declaration.

- Clarified comments in vdev_queue.c regarding when per-queue minimums take
  effect.

- Replaced dmu_tx_write_limit in the dmu_tx kstat file
  with dmu_tx_dirty_delay and dmu_tx_dirty_over_max.  The first counts
  how many times a transaction has been delayed because the pool dirty
  data has exceeded zfs_delay_min_dirty_percent.  The latter counts how
  many times the pool dirty data has exceeded zfs_dirty_data_max (which
  we expect to never happen).

- The original patch would have regressed the bug fixed in
  zfsonlinux/zfs@c418410, which prevented users from setting the
  zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
  A similar fix is added to vdev_queue_aggregate().

- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
  heap instead of the stack.  In Linux we can't afford such large
  structures on the stack.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  http://www.illumos.org/issues/4045
  illumos/illumos-gate@69962b5647

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-12-06 09:32:43 -08:00
Matthew Ahrens
384f8a09f8 Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.

Porting Notes:

zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4347
  illumos/illumos-gate@e722410c49

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 09:30:51 -08:00
Brian Behlendorf
2e40f09410 Remove incorrect ASSERT in zfs_sb_teardown()
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them.  This is always true for the standard unmount
case but there are two other cases where it is not strictly true.

* zfs_ioc_rollback() - This is the most common case and it results
  from the fact that we aren't unmounting the filesystem.  During a
  normal unmount the MS_ACTIVE flag will be cleared on the super block
  causing iput_final() to evict the inode when its reference count
  drops to zero.  However, during a rollback MS_ACTIVE remains set
  since we're rolling back a live filesystem and need to preserve the
  existing super block.  This allows inodes with a zero reference count
  to stay in the cache thereby violating the assertion.

* destroy_inode() / zfs_sb_teardown() - There exists a small race
  between dropping the last reference on an inode and removing it from
  the zsb->z_all_znodes list.  This is unlikely to occur but could also
  trigger the assertion which is incorrect.  The inode may safely have
  a zero reference count in this case.

Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT.  This code is only enabled for default builds so removing
this entirely is a very safe change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1417
Closes #1536
2013-12-02 15:58:58 -08:00
Tim Chase
f707635fa5 Some nvlist allocations in hold processing need to use KM_PUSHPAGE.
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit
65c67ea86e.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1852
Closes #1855
2013-12-02 14:02:46 -08:00
Etienne Dechamps
119a394ab0 Only commit the ZIL once in zpl_writepages() (msync() case).
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
2013-11-23 15:08:29 -08:00
Brian Behlendorf
e3dc14b861 Add I/O Read/Write Accounting
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free.  However, the Linux kernel does provide helper
functions allow us to perform our own accounting.  These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #313
Closes #1275
2013-11-21 08:56:24 -08:00
Steven Hartland
e5bacf2109 Illumos #4322
4322 ZFS deadlock on dp_config_rwlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4322
  illumos/illumos-gate@c50d56f667

Ported by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1886
2013-11-20 15:27:32 -08:00
Brian Behlendorf
64ad2b26e2 Remove the slog restriction on bootfs pools
Under Linux this restriction does not apply because we have access
to all the required devices.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1631
2013-11-14 14:28:35 -08:00
Matthew Thode
227bc96951 Fixes (extends) support for selinux xattrs to more inode types
Properly initialize SELinux xattrs for all inode types.  The
initial implementation accidentally only did this for files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1832
2013-11-14 14:28:35 -08:00
Brian Behlendorf
a168788053 Reduce stack for traverse_visitbp() recursion
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp().  This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath.  To minimize the stack usage
for this call path the following changes were made:

1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
   to prevent them from being inlined by gcc.  This reduced the
   stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
   vdev_mirror_io_start() from 144 to 128 bytes.

2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
   from the stack to the heap.  This reduced the stack usage of
   zfsdev_ioctl() from 368 to 112 bytes.

3) The major saving came from slimming down traverse_visitbp() from
   from 224 to 144 bytes.  Since this function is called recursively
   the 80 bytes saved per invokation adds up.  The following changes
   were made:

  a) The 'hard' local variable was replaced by a TD_HARD() macro.

  b) The 'pd' local variable was replaced by 'td->td_pfd' references.

  c) The zbookmark_t was moved to the heap.  This does cost us an
     additional memory allocation per recursion by that cost should
     still be minimal.  The cost could be further reduced by adding
     a dedicated zbookmark_t slab cache.

  d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
     restructured to use the minimum amount of stack.  This includes
     removing the 'cbp' local variable.

Overall for the offending use case roughly 1584 of total stack space
has been saved.  This is enough to avoid overflowing the stack on
stock kernels with 8k stacks.  See #1778 for additional details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1778
2013-11-14 14:28:12 -08:00