Commit Graph

570 Commits

Author SHA1 Message Date
Brian Behlendorf
dba1d70566 Fix arc_adapt() spinning in iterate_supers_type()
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type.  Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.

This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself.  If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.

The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node.  However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used.  Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1045
Closes #861
Closes #790
2013-07-17 09:28:06 -07:00
Brian Behlendorf
c9ada6d5a0 Fix read-only pool hang on unmount
During mount a filesystem dataset would have the MS_RDONLY bit
incorrectly cleared even if the entire pool was read-only.
There is existing to code to handle this case but it was being run
before the property callbacks were registered.  To resolve the
issue we move this read-only code after the callback registration.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1338
2013-07-17 09:22:23 -07:00
Brian Behlendorf
76351672c2 Fix zfsctl_expire_snapshot() deadlock
It is possible for an automounted snapshot which is expiring to
deadlock with a manual unmount of the snapshot.  This can occur
because taskq_cancel_id() will block if the task is currently
executing until it completes.  But it will never complete because
zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which
zfsctl_expire_snapshot() must acquire.

---------------------- z_unmount/0:2153 ---------------------
  mutex_lock                <blocking on zsb->z_ctldir_lock>
  zfsctl_unmount_snapshot
  zfsctl_expire_snapshot
  taskq_thread

------------------------- zfs:10690 -------------------------
  taskq_wait_id             <waiting for z_unmount to exit>
  taskq_cancel_id
  __zfsctl_unmount_snapshot
  zfsctl_unmount_snapshot   <takes zsb->z_ctldir_lock>
  zfs_unmount_snap
  zfs_ioc_destroy_snaps_nvl
  zfsdev_ioctl
  do_vfs_ioctl

We resolve the deadlock by dropping the zsb->z_ctldir_lock before
calling __zfsctl_unmount_snapshot().  The lock is only there to
prevent concurrent modification to the zsb->z_ctldir_snaps AVL
tree.  Moreover, we're careful to remove the zfs_snapentry_t from
the AVL tree before dropping the lock which ensures no other tasks
can find it.  On failure it's added back to the tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #1527
2013-07-12 10:06:53 -07:00
Brian Behlendorf
556011dbec Improve N-way mirror performance
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.

The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror.  It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them.  In practice this results in the
leaf vdevs being under utilized.

Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO.  This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror.  Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.

In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds.  Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.

The testing results show a significant performance improvement
for all four workloads tested.  The workloads were generated
using the fio benchmark and are as follows.

1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).

               | Pristine              |  With 1461             |
               | Sequential  Random    |  Sequential  Random    |
               | 1MB  4KB    1MB  4KB  |  1MB  4KB    1MB  4KB  |
               | MB/s MB/s   IO/s IO/s |  MB/s MB/s   IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped      | 226  243     11  304  |  222  255     11  299  |
2 2-Way Mirror | 302  324     16  534  |  433  448     23  571  |
2 3-Way Mirror | 429  458     24  714  |  648  648     41  808  |
2 4-Way Mirror | 562  601     36  849  |  816  828     82  926  |

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1461
2013-07-11 13:53:50 -07:00
Prakash Surya
92334b14ec Add new kstat for monitoring time in dmu_tx_assign
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:

    $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
    12 1 0x01 32 1536 19792068076691 20516481514522
    name                            type data
    1 us                            4    859
    2 us                            4    252
    4 us                            4    171
    8 us                            4    2
    16 us                           4    0
    32 us                           4    2
    64 us                           4    0
    128 us                          4    0
    256 us                          4    0
    512 us                          4    0
    1024 us                         4    0
    2048 us                         4    0
    4096 us                         4    0
    8192 us                         4    0
    16384 us                        4    0
    32768 us                        4    1
    65536 us                        4    1
    131072 us                       4    1
    262144 us                       4    4
    524288 us                       4    0
    1048576 us                      4    0
    2097152 us                      4    0
    4194304 us                      4    0
    8388608 us                      4    0
    16777216 us                     4    0
    33554432 us                     4    0
    67108864 us                     4    0
    134217728 us                    4    0
    268435456 us                    4    0
    536870912 us                    4    0
    1073741824 us                   4    0
    2147483648 us                   4    0

one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1584
2013-07-11 13:53:44 -07:00
Brian Behlendorf
bf89c19914 Log pool suspension warnings to the console
In the event that a pool gets suspended log this information to
the console.  This is critical information and we want to make
sure it gets logged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1555
2013-07-10 15:15:52 -07:00
Brian Behlendorf
abc41ac7c7 Use GFP_NOIO in vdev_disk_io_flush()
To avoid a potential deadlock when using a zvol as a swap
device prevent vdev_disk_io_flush() from performing IO during
the bio_alloc().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1508
2013-07-10 14:12:21 -07:00
Ying Zhu
b4f7f10527 Improve code in arc_buf_remove_ref
When we remove references of arc bufs in the arc_anon state we
needn't take its header's hash_lock, so postpone it to where we
really need it to avoid unnecessary invocations of function buf_hash.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1557
2013-07-09 11:53:28 -07:00
Shen Yan
8e07b99b2f Update zio.c
The cv_wait_io is used to account io time instead of cv_wait.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1566
2013-07-09 10:41:46 -07:00
Brian Behlendorf
31455ab130 Add zfs_autoimport_disable tunable
There are times when it is desirable for zfs to not automatically
populate the spa namespace at module load time using the pools
in the /etc/zfs/zpool.cache file.  The zfs_autoimport_disable
module option has been added to control this behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #330
2013-07-09 10:11:19 -07:00
Chris Dunlop
a1d9543a39 3.10 API change: block_device_operations->release() returns void
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void.  Detect the expected
prototype and defined our callout accordingly.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1494
2013-07-08 15:41:57 -07:00
Brian Behlendorf
91604b298c Open pools asynchronously after module load
One of the side effects of calling zvol_create_minors() in
zvol_init() is that all pools listed in the cache file will
be opened.  Depending on the state and contents of your pool
this operation can take a considerable length of time.

Doing this at load time is undesirable because the kernel
is holding a global module lock.  This prevents other modules
from loading and can serialize an otherwise parallel boot
process.  Doing this after module inititialization also
reduces the chances of accidentally introducing a race
during module init.

To ensure that /dev/zvol/<pool>/<dataset> devices are
still automatically created after the module load completes
a udev rules has been added.  When udev notices that the
/dev/zfs device has been create the 'zpool list' command
will be run.  This then will cause all the pools listed
in the zpool.cache file to be opened.

Because this process in now driven asynchronously by udev
there is the risk of problems in downstream distributions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #756
Issue #1020
Issue #1234
2013-07-03 09:24:38 -07:00
Richard Yao
2a3871d4bc Cleanup zvol initialization code
The following error will occur on some (possibly all) kernels
because blk_init_queue() will try to take the spinlock before
we initialize it.

  BUG: spinlock bad magic on CPU#0, zpool/4054
   lock: 0xffff88021a73de60, .magic: 00000000,
   .owner: <none>/-1, .owner_cpu: 0
  Pid: 4054, comm: zpool Not tainted 3.9.3 #11
  Call Trace:
   [<ffffffff81478ef8>] spin_dump+0x8c/0x91
   [<ffffffff81478f1e>] spin_bug+0x21/0x26
   [<ffffffff812da097>] do_raw_spin_lock+0x127/0x130
   [<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30
   [<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350
   [<ffffffff812aacb8>] elevator_init+0x78/0x140
   [<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0
   [<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70
   [<ffffffff812b271e>] blk_init_queue+0xe/0x10
   [<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620
   [<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30
   [<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510
   [<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510
   [<ffffffff81253ea2>] zvol_create_minors+0x92/0x180
   [<ffffffff811f8d80>] spa_open_common+0x250/0x380
   [<ffffffff811f8ece>] spa_open+0xe/0x10
   [<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80
   [<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190
   [<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0
   [<ffffffff8116a950>] sys_ioctl+0x40/0x80
   [<ffffffff814812c9>] ? do_page_fault+0x9/0x10
   [<ffffffff81483929>] system_call_fastpath+0x16/0x1b
   zd0: unknown partition table

We fix this by calling spin_lock_init before blk_init_queue.

The manner in which zvol_init() initializes structures is
suspectible to a race between initialization and a probe on
a zvol. We reorganize zvol_init() to prevent that.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-03 09:23:35 -07:00
Pawel Jakub Dawidek
526af78550 Call zvol_create_minors() in spa_open_common() when initializing pool
There is an extremely odd bug that causes zvols to fail to appear on
some systems, but not others. Recently, I was able to consistently
reproduce this issue over a period of 1 month. The issue disappeared
after I applied this change from FreeBSD.

This is from FreeBSD's pool version 28 import, which occurred in
revision 219089.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #441
Issue #599
2013-07-03 09:22:44 -07:00
George Wilson
294f68063b Illumos #3498 panic in arc_read()
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@1b912ec710
  https://www.illumos.org/issues/3498

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1249
2013-07-02 13:34:31 -07:00
Matthew Ahrens
96b89346c0 Illumos #3122 zfs destroy filesystem should prefetch blocks
3122 zfs destroy filesystem should prefetch blocks
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@b4709335aa
  https://www.illumos.org/issues/3122

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1565
2013-07-02 13:34:02 -07:00
Cyril Plisko
29dee3ee9a Add zfs_sync_pass_* tunable parameters
Commit 55d85d5a8c (backport of
the upstream changes) replaced three hardcoded constants:

    #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */
    #define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
    #define SYNC_PASS_REWRITE       1 /* rewrite new bps after this pass */

with a tunable parameters:

    int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
    int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
    int zfs_sync_pass_rewrite = 2;       /* rewrite new bps starting in this pass */

This commit makes these tunables available as module parameters
in Linux.  They should only be used for performance analysis
because changing them can result in subtle and pathological
performance problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1562
2013-07-02 09:34:18 -07:00
Li Dongyang
802e7b5feb Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.

Tested with xfstests 285 and 286.  All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.

Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.

An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1384
2013-07-02 09:24:43 -07:00
Matthew Ahrens
cf91b2b6b2 Readd zfs_holey() from OpenSolaris
This patch restores the zfs_holey() function from OpenSolaris.
This was removed by commit 3558fd7 because it wasn't clear we
had a use for it in ZoL.  However, this functionality is a
prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #1384
2013-07-02 09:24:18 -07:00
shenyan1
0a6bef26ec kmem_zalloc(..., KM_SLEEP) will never fail
By definitition these allocations will never fail.  For
consistency with the rest of the code remove this dead error
handling code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1558
2013-07-01 14:51:48 -07:00
Tim Chase
ab68b6e5db Fix zfs_sb_teardown/zfs_resume_fs NULL dereference
Fix a pair of conditions in which a concurrent umount can cause
NULL pointer dereferences:

* zfs_sb_teardown - prevent a NULL dereference by not calling
                    dmu_objset_pool with a null z_os.

* zfs_resume_fs - don't try to unmount with a null z_os.  This
                  change makes the ZoL code more consistent
                  with both Illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1543
2013-07-01 14:51:45 -07:00
Ying Zhu
c12936b141 Fix module probe failure on 32-bit systems
Previous commit 7ef5e54e2e caused
module probe failure on 32-bit systems, dmesg showed

  Unknown symbol __moddi3

This was caused by the modulo operation 'gethrtime() % tqs->stqs_count'
in the committed code.  Instead of implementing __moddi3 for all 32-bit
systems, Behlendorf advised we can just cast the return value of
gethrtime() into a uint64_t, since gethrtime does not return negative
value on all circumstances we need not care about the potential overflow.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1551
2013-06-27 10:01:25 -07:00
Brian Behlendorf
88c283952f Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGS
Until these hooks are fully implemented return the expected
-EOPNOTSUPP error to indicate they are not functional.  This
allows test suites such as xfstests to cleanly skip testing
this functionality until it's implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #229
2013-06-26 15:20:13 -07:00
Brian Behlendorf
81eaf15107 Register correct handlers in nvlist_alloc()
The non-blocking allocation handlers in nvlist_alloc() would be
mistakenly assigned if any flags other than KM_SLEEP were passed.
This meant that nvlists allocated with KM_PUSHPUSH or other KM_*
debug flags were effectively always using atomic allocations.

While these failures were unlikely it could lead to assertions
because KM_PUSHPAGE allocations in particular are guaranteed to
succeed or block.  They must never fail.

Since the existing API does not allow us to pass allocation
flags to the private allocators the cleanest thing to do is to
add a KM_PUSHPAGE allocator.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#249
2013-06-20 09:58:15 -07:00
Matthew Ahrens
df4474f92d Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@6e6d5868f5
  https://www.illumos.org/issues/3805

ZFS should proactively evict freed blocks from the cache.

On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk.  We were wasting about half the
system's RAM (252GB) on blocks that have been freed.

Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:

1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself).  The secondary control is to
evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks.  Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB.  However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless.  Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation.  However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-20 09:55:52 -07:00
George Wilson
e51be06697 Illumos #3552, #3564
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@16a4a80742
  https://www.illumos.org/issues/3552
  https://www.illumos.org/issues/3564

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1513
2013-06-19 16:22:39 -07:00
Madhav Suresh
c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Brian Behlendorf
0377189b88 Only check directory xattr on ENOENT
When SA xattrs are enabled only fallback to checking the directory
xattrs when the name is not found as a SA xattr.  Otherwise, the SA
error which should be returned to the caller is overwritten by the
directory xattr errors.  Positive return values indicating success
will also be immediately returned.

In the case of #1437 the ERANGE error was being correctly returned
by zpl_xattr_get_sa() only to be overridden with ENOENT which was
returned by the subsequent unnessisary call to zpl_xattr_get_dir().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1437
2013-05-10 12:24:56 -07:00
Cyril Plisko
4f34b3bdf4 zfs_scrub_limit tunable is not used anywhere
As a part of scrub/resilver tuning zfs_scrub_limit fell out of use,
but the definition of the variable remained in place.
Moreover various guides still (misleadingly) mention it as a way
to influence resilver/scrub behavior.
This commit removes its finally.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1444
2013-05-06 14:14:06 -07:00
Ying Zhu
ee664d4631 Fix incorrect assertions in ddt_phys_decref and ddt_sync_entry
The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt
from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the
reference count of zero blocks commonly available in spare files, we may
mistakenly hit these assertations, so drop the type conversions here.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1436
2013-05-06 14:10:55 -07:00
Brian Behlendorf
044baf009a Use taskq for dump_bytes()
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions.  These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with.  In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.

To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq.  This ensures that all I/O will be
performed with the majority of the stack space available.  This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1399
Closes #1423
2013-05-06 14:05:42 -07:00
Adam Leventhal
7ef5e54e2e Illumos #3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock contention
3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@ec94d32
  https://illumos.org/issues/3581

Notes for Linux port:

Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU.  We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.

This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues.  The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed.  It's not clear if 3329 affects the
Linux port or not.  I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.

I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time.  However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%.  Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.

Ported-by: Ned Bass <bass6@llnl.gov>
Closes #1327
2013-05-06 14:05:37 -07:00
George Wilson
55d85d5a8c Illumos #3329, #3330, #3331, #3335
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@01f55e48fb
  https://www.illumos.org/issues/3329
  https://www.illumos.org/issues/3330
  https://www.illumos.org/issues/3331
  https://www.illumos.org/issues/3335

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-06 12:39:34 -07:00
George Wilson
5853fe790d Illumos #3306, #3321
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
     page and help

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@31d7e8fa33
  https://www.illumos.org/issues/3306
  https://www.illumos.org/issues/3321

The vdev_file.c implementation in this patch diverges significantly
from the upstream version.  For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.

1) The vdev_disk.c code in ZoL has been rewritten to use the
   Linux block device interfaces which differ significantly
   from those in Illumos.  Therefore, updating the vdev_file.c
   to use the Illumos interfaces doesn't get you consistency
   with vdev_disk.c.

2) Using the upstream patch as is would requiring implementing
   compatibility code for those Solaris block device interfaces
   in user and kernel space.  That additional complexity could
   lead to confusion and doesn't buy us anything.

For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function.  This has the advantage of being
low risk and easy to understand.  Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.

Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased.  The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention.  We increase it to 8 until
the contention can be address by porting Illumos #3581.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1354
2013-05-03 16:53:52 -07:00
George.Wilson
cc92e9d0c3 3246 ZFS I/O deadman thread
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation

*) Usage of the cyclic interface was replaced by the delayed taskq
   interface.  This avoids the need to implement new compatibility
   code and allows us to rely on the existing taskq implementation.

*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
   because declaring externs in source files as was done in the
   original patch is just plain wrong.

*) Instead of panicing the system when the deadman triggers a
   zevent describing the blocked vdev and the first pending I/O
   is posted.  If the panic behavior is desired Linux provides
   other generic methods to panic the system when threads are
   observed to hang.

*) For reference, to delay zios by 30 seconds for testing you can
   use zinject as follows: 'zinject -d <vdev> -D30 <pool>'

References:
  illumos/illumos-gate@283b84606b
  https://www.illumos.org/issues/3246

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1396
2013-05-01 17:05:52 -07:00
Brian Behlendorf
57f5a2008e Fix txg_quiesce thread deadlock
A deadlock was accidentally introduced by commit e95853a which
can occur when the system is under memory pressure.  What happens
is that while the txg_quiesce thread is holding the tx->tx_cpu
locks it enters memory reclaim.  In the context of this memory
reclaim it then issues synchronous I/O to a ZVOL swap device.
Because the txg_quiesce thread is holding the tx->tx_cpu locks
a new txg cannot be opened to handle the I/O.  Deadlock.

The fix is straight forward.  Move the memory allocation outside
the critical region where the tx->tx_cpu locks are held.  And for
good measure change the offending allocation to KM_PUSHPAGE to
ensure it never attempts to issue I/O during reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1274
2013-04-26 14:42:36 -07:00
Brian Behlendorf
f706421173 Correctly return ERANGE in getxattr(2)
According to the getxattr(2) man page the ERANGE errno should be
returned when the size of the value buffer is to small to hold the
result.  Prior to this patch the implementation would just truncate
the value to size bytes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1408
2013-04-24 12:35:04 -07:00
Chris Dunlop
254255f735 Trivial spelling fix
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1411
2013-04-19 15:43:16 -07:00
Caleb James DeLisle
8f1e11b610 Remove .readdir from zpl_file_operations table
The zpl_readdir() function shouldn't be registered as part of
the zpl_file_operations table, it must only be part of the
zpl_dir_file_operations table.  By removing this callback
the VFS will now correctly return ENOTDIR when calling
getdents() on a file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1404
2013-04-19 15:36:47 -07:00
Martin Matuska
b28e57cb82 Allow setting a lower ashift with -o ashift
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices.  However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported.  In practice,
there are several cases where settong a lower ashift is useful:

* Most modern drives now correctly report their physical sector
  size as 4k.  This causes zfs to correctly default to using a 4k
  sector size (ashift=12).  However, for some usage models this
  new default ashift value causes an unacceptable increase in
  space usage.  Filesystems with many small files may see the
  total available space reduced to 30-40% which is unacceptable.

* When replacing a drive in an existing pool which was created
  with ashift=9 a modern 4k sector drive cannot be used.  The
  'zpool replace' command will issue an error that the new drive
  has an 'incompatible sector alignment'.  However, by allowing
  the ashift to be manual specified as smaller, non-optimal,
  value the device may still be safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1381
Closes #1328
Issue #967
Issue #548
2013-04-12 10:50:46 -07:00
George Wilson
295304bed6 Illumos #3422, #3425
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@bda8819455
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1390
2013-04-12 09:01:36 -07:00
Jan Engelhardt
ea0fcfc875 gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:50:17 -07:00
Jan Engelhardt
4e95cc99b0 build: resolve orthographic and other grammatical errors
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 10:44:52 -07:00
Brian Behlendorf
5dc6af0eec Add zio_ddt_free()+ddt_phys_decref() error handling
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match.  However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().

While this should never happen it has been observed on various
platforms.  The result is that unless your willing to patch the
ZFS code the pool is inaccessible.  Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.

http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1308
2013-03-19 13:01:01 -07:00
Brian Behlendorf
30b92c1de6 Add metaslab_debug option
Enabling metaslab debugging will prevent space maps from being
automatically unloaded.  This can significantly increase the
memory footprint but being able to dynamically control this is
helpful for debugging and certain performance testing.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-18 16:47:43 -07:00
Richard Yao
1c24b699b0 Linux 3.9 compat: Undefine GCC_VERSION
The mainline kernel started defining GCC_VERSION with commit
torvalds/linux@3f3f8d2f48.
Unfortunately, LZ4 also defines this macro, but the two
defintions are incompatible. We undefine GCC_VERSION in lz4.c
to handle this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1339
2013-03-06 15:48:48 -08:00
Ned Bass
92db59ca3b Refresh links to web site
A few files still refer to @behlendorf's private fork on
github.  Use the primary web site URL instead.  Two typos
are also corrected.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:41 -08:00
Brian Behlendorf
d09f98a9a6 Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/zfs'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/zfs' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-06 15:46:40 -08:00
Eric Dillmann
0b4d1b5853 Add snapdev=[hidden|visible] dataset property
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices.  By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/.  When set to
'visible' all zvol snapshots for the dataset will be visible.

This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created.  When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions.  This leads
to a variety of issues:

1) The zvol partition tables will be read in the context of
   the `modprobe zfs` for automatically imported pools.  This
   is undesirable and should be done asynchronously, but for
   now reducing the number of visible devices helps.

2) Udev expects to be able to complete its work for a new
   block devices fairly quickly.  When many zvol devices are
   added at the same time this is no longer be true.  It can
   lead to udev timeouts and missing /dev/zvol links.

3) Simply having lots of devices in /dev/ can be aukward from
   a management standpoint.  Hidding the devices your unlikely
   to ever use helps with this.  Any snapshot device which is
   needed can be made visible by changing the snapdev property.

NOTE: This patch changes the default behavior for zvols which
      was effectively 'snapdev=visible'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1235
Closes #945
Issue #956
Issue #756
2013-03-05 12:37:54 -08:00
George Wilson
a4430fce69 Merge zvol.c changes from PSARC 2010/306 Read-only ZFS pools
The changes to zvol.c were never merged from the last onnv_147
bulk update.  This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.

This causes a regression when importing a zpool with zvols
read-only.  This does not impact pool which only contain
filesystem datasets.

References:
  illumos/illumos-gate@f9af39b

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1332
Closes #1333
2013-03-04 09:56:13 -08:00
Richard Yao
b01615d5ac Constify structures containing function pointers
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.

Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:

vdev_opts_t             52
zil_replay_func_t       24
zio_compress_info_t     24
zio_checksum_info_t     9
space_map_ops_t         7
arc_byteswap_func_t     5

The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1300
2013-03-04 08:49:32 -08:00
Brian Behlendorf
8128bd89fb Fix hot spares
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL).  On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.

This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable.  It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.

To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.

1) Functions which open the device but only read had the
   O_EXCL flag removed and were updated to use O_RDONLY.

2) A common holder was added to the vdev disk code.  This
   allow the ZFS code to internally open the device multiple
   times but non-ZFS callers may not.

3) An exception was added to make_disks() for hot spare when
   creating partition tables.  For hot spare devices which
   are already opened exclusively we skip creating the partition
   table because this must already have been done when the disk
   was originally added as a hot spare.

Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix.  And is_spare() was moved
above make_disks() to avoid adding a forward reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #250
2013-03-01 13:31:02 -08:00
Brian Behlendorf
bd99a7584a Remove wholedisk check from vdev_disk_open()
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1.  The wholedisk handling
is performed by the user space utilities.  To prevent confusion
this dead code is being removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Brian Behlendorf
0d8103d956 Leaf vdevs should not be reopened
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case.  According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set.  Under Linux
we would always close and open the device.

This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing.  The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links.  The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.

This patch adds the missing functionality in a similiar fashion to
to the Illumos code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-28 12:02:59 -08:00
Etienne Dechamps
d9b0ebbe82 Remove the bio_empty_barrier() check.
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.

Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.

As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.

This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.

Thanks to Richard Kojedzinszky for catching this nasty bug.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1318
2013-02-24 10:22:34 -08:00
Brian Behlendorf
546c978bbd Enable zfs_arc_memory_throttle_disable by default
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.

When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues.  We've now reached that point.  Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.

This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer.  This was the only caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
2013-02-21 13:38:24 -08:00
Richard Yao
8dca0a9a38 Make spa.c assertions catch unsupported pre-feature flag pool versions
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them.  We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1282
2013-02-12 10:27:44 -08:00
Brian Behlendorf
9878a89d7a Add explicit MAXNAMELEN check
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN.  It does however
appear to correctly handle MAXPATHLEN for us.

The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function.  Several in-tree filesystems
handle this case the same way.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1279
2013-02-12 10:27:39 -08:00
Ned Bass
ed2e157605 Switch KM_SLEEP to KM_PUSHPAGE
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.

Also make a small correction to the comment block above
dsl_dir_open_spa().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1268
2013-02-06 11:19:58 -08:00
Brian Behlendorf
dd26aa535b Cast 'zfs bad bloc' to ULL for x86
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used.  Otherwise we get the following compiler error:

  dmu_send.c:376: error: integer constant is too large for
  ‘long’ type

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-04 16:39:08 -08:00
Brian Behlendorf
0c5493d470 Add zfs_arc_memory_throttle_disable module option
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle().  The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.

To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users.  Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.

By default this option is disabled to preserve the current
behavior.  However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea.  We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #938
2013-02-01 11:17:14 -08:00
Brian Behlendorf
1f7c30df8f Add zfs_disable_dup_eviction module option
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-01 09:57:57 -08:00
Ned Bass
36f86f73f6 Fix mismatch between SA header size and layout
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout.  The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:

SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed

The bug originates in this snippet from sa_find_sizes().

    if (is_var_sz && var_size > 1) {
            if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
                *total < full_space) {
                    hdrsize += sizeof (uint16_t);

This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.

To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1250
2013-01-31 10:31:19 -08:00
Ned Bass
67629d0f08 Fix rounding discrepancy in sa_find_sizes()
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't.  A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.

The following reproducer for this bug was isolated:

    truncate -s 128m /tmp/tank
    zpool create tank /tmp/tank
    zfs create -o xattr=sa tank/fish
    ln -s `perl -e 'print "z" x 41'` /tank/fish/z
    setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z

This test results in roughly the following system attribute (SA)
layout:

  176 bytes - "standard" SA's
   41 bytes - name of symbolic link target
  100 bytes - XDR encoded nvlist for xattr
  ---
  317 bytes - total

Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.

Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.

To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.

To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1240
2013-01-31 10:31:13 -08:00
Adam H. Leventhal
89103a2643 Illumos #3447 improve the comment in txg.c
3447 improve the comment in txg.c

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@adbbcfface
  https://www.illumos.org/issues/3447

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-30 08:55:20 -08:00
Eric Dillmann
9759c60f1a Illumos #3035 LZ4 compression support in ZFS and GRUB
3035 LZ4 compression support in ZFS and GRUB

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>

References:
  illumos/illumos-gate@a6f561b4ae
  https://www.illumos.org/issues/3035
  http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS

This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux.  Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.

Support for GRUB has been dropped from this patch.  That code
is available but those changes will need to made to the upstream
GRUB package.

Lastly, several hunks of dead code were dropped for clarity.  They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.

Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1217
2013-01-29 09:28:20 -08:00
Chris Wedgwood
ddc07fa57a Avoid gcc -Werror=maybe-uninitialized warnings
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1244
2013-01-28 09:10:29 -08:00
Brian Behlendorf
6772fb679a Use dsl_dataset_snap_lookup()
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation.  There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.

https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1238
2013-01-25 15:07:40 -08:00
Brian Behlendorf
bf01b5e616 Add d_clear_d_op() compatibility
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table.  This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1230
2013-01-23 16:33:29 -08:00
Ned Bass
1305d33a4b fzap_cursor_move_to_key() should drop l_rwlock
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success.  All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not.  This may block writers or cause VERIFY failures when the
lock is freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1215
Closes zfsonlinux/spl#143
Closes zfsonlinux/spl#97
2013-01-23 16:31:16 -08:00
Brian Behlendorf
09a661e960 Fix zpl_revalidate() NULL deref
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter.  In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.

Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1226
2013-01-22 09:38:17 -08:00
Brian Behlendorf
ee93035378 Use sb->s_d_op default dentry operations
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1223
2013-01-18 15:04:23 -08:00
Massimo Maggi
babf3f9b6d Fix zpool on zvol deadlock
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection".  By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1220
2013-01-18 09:44:55 -08:00
Brian Behlendorf
7973e464de Revert "Revert "Fix unlink/xattr deadlock""
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.

These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.

Unfortunately, the deadlock described in #1176  was a case
which wasn't considered.  At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe.  The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.

Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1176
Issue #457
2013-01-17 11:24:20 -08:00
Brian Behlendorf
7b3e34ba5a Fix 'zfs rollback' on mounted file systems
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #795
2013-01-17 09:51:20 -08:00
Ned Bass
f1a05fa114 Fix false ENOENT on snapshot control dentries
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created.  This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.

The bug can be reproduced as follows (the second ls should succeed):

 $ ls /tank/.zfs/snapshot/s
 ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
 $ zfs snap tank@s
 $ ls /tank/.zfs/snapshot/s
 ls: cannot access /tank/.zfs/snapshot/s: No such file or directory

To remedy this, always invalidate cached dentries in the snapshot
control directory.  Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1192
2013-01-16 16:28:54 -08:00
Ned Bass
94a9bb4709 Fix quoting error in unmount command
A misplaced single quote caused the umount command to fail with a
syntax error when unmounting snapshots under the .zfs/snapshot
control directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1210
2013-01-16 15:30:47 -08:00
Christopher Siden
b077fd4c4e Illumos #3189 kernel panic in test hotspare_onoffline_004_neg
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@8f0b538d1d
  changeset: 13818:e9ad0a945d45
  https://www.illumos.org/issues/3189

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-14 10:34:53 -08:00
Arne Jansen
ff80d9b142 Illumos #1862 incremental zfs receive fails for sparse file > 8PB
1862 incremental zfs receive fails for sparse file > 8PB

Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Simon Klinkert <klinkert@webgods.de>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@31495a1e56
  illumos changeset: 13789:f0c17d471b7a
  https://www.illumos.org/issues/1862

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-14 10:34:41 -08:00
Matthew Ahrens
a94addd974 Illumos #3208 cross-endian incorrect user/group accounting
3208 moving zpool cross-endian results in incorrect user/group
accounting

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@e828a46d29
  illumos changeset: 13835:eea81edc4f14
  https://www.illumos.org/issues/3208

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #627
Closes #1136
2013-01-14 09:32:22 -08:00
Bart Coddens
5c83989071 Illumos #2618 arc.c mistypes in the comments
2618 arc.c mistypes in the comments

Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@fc98fea58e
  illumos changeset: 13721:5b51a16a186f
  https://www.illumos.org/issues/2618

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-11 09:16:59 -08:00
Ned Bass
761394b3af call_usermodehelper() should wait for process
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.

One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #816
2013-01-09 16:54:52 -08:00
Brian Behlendorf
1c50c992ba Revert "Avoid ELOOP on auto-mounted snapshots"
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot.  It updates the vfsmount such that it references the
original dataset vfsmount.  The result is that the snapshot itself
isn't visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
2013-01-09 11:24:47 -08:00
Brian Behlendorf
4cec9b2dc7 Only reduce __zio_execute() stack usage in kernel space
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 10:34:35 -08:00
George Wilson
1eb5bfa3dc Illumos #3145, #3212
3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos-gate/commit/9253d63df408bb48584e0b1abfcc24ef2472382e
  illumos changeset: 13840:97fd5cdf328a
  https://www.illumos.org/issues/3145
  https://www.illumos.org/issues/3212

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #989
Closes #1137
2013-01-08 10:35:44 -08:00
Matthew Ahrens
753c38392d Illumos #3104: eliminate empty bpobjs
3104 eliminate empty bpobjs
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@f174573681
  illumos changeset: 13782:8f78aae28a63
  https://www.illumos.org/issues/3104

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Brian Behlendorf
91579709fc Fix __zio_execute() asynchronous dispatch
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization.  See commit
2fac4c2 for the original patch and motivation.

Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization.  This
commit refines the existing logic to specificly target only the two
cases we care about.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson
ea0b2538cd Illumos #3349: zpool upgrade -V bumps the on disk version number
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@25345e4666
  https://www.illumos.org/issues/3349

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Matthew Ahrens
29809a6cba Illumos #3086: unnecessarily setting DS_FLAG_INCONSISTENT on async
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async
destroyed datasets
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@ce636f8b38
  illumos changeset: 13776:cd512c80fd75
  https://www.illumos.org/issues/3086

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
Christopher Siden
b9b24bb4ca Illumos #2762: zpool command should have better support for feature flags
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@57221772c3
  https://www.illumos.org/issues/2762

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:43 -08:00
George Wilson
3bc7e0fb0f Illumos #3090 and #3102
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@dfbb943217
  illumos changeset: 13777:b1e53580146d
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #939
2013-01-08 10:35:42 -08:00
Christopher Siden
9ae529ec5d Illumos #2619 and #2747
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  illumos/illumos-gate@53089ab7c8
  illumos/illumos-gate@ad135b5d64
  illumos changeset: 13700:2889e2596bd6
  https://www.illumos.org/issues/2619
  https://www.illumos.org/issues/2747

NOTE: The grub specific changes were not ported.  This change
must be made to the Linux grub packages.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 10:35:35 -08:00
Ned Bass
37f000c5aa Fix gcc array subscript above bounds warning
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c

    } else {
            int i;
            ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
            /* the blkptrs we are losing better be unallocated */
            for (i = dn->dn_next_nblkptr[txgoff];
                i < dnp->dn_nblkptr; i++)
                    ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));

This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning.  To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body.  Since a dnode contains
at most 3 block pointers this overhead should be negligible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #950
2013-01-07 11:21:52 -08:00
Matt Johnston
72938d6905 Use cv_wait_io() which will will account for iowait
Update zio_wait() to use cv_wait_io() to ensure the iowait time
is properly accounted for.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:52:52 -08:00
Matt Johnston
72f53c5694 Revert part of "Log I/Os longer than zio_delay_max (30s default)"
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.

The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device.  This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.

This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely  on io_schedule() via cv_wait_io().

The additional debugging information which was added to the
delay event has been intentionally left in place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-07 10:51:04 -08:00
Brian Behlendorf
65d56083b4 Fix zpool on zvol lock inversion deadlock
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock.  But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.

To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock().  Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.

When it is contended we risk a lock inversion if we were to
block waiting for the lock.  Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again.  This process can be repeated safely until both
locks are acquired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #612
2012-12-20 09:57:39 -08:00
Brian Behlendorf
d5446cfc52 Revert "Remove TSD zfs_fsyncer_key"
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-20 09:56:28 -08:00
Brian Behlendorf
31f2b5abdf Remove TSD zfs_fsyncer_key
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit().  This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.

However, under Linux I'm not seeing the same contention that
was originally described.  Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD.  If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.

This just leaves one small ZFS TSD consumer.  If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#174
2012-12-19 09:08:01 -08:00
Prakash Surya
84daaddedb Set elevator for DM devices despite vdev_wholedisk
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:

    $ ls -d /dev/dm-*
    /dev/dm-0
    /dev/dm-1
    /dev/dm-2
    /dev/dm-3

it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:

    $ ls -d /dev/sd*
    /dev/sda
    /dev/sda1
    /dev/sdb
    /dev/sdb1

Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.

As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.

This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1149
2012-12-18 15:12:40 -08:00
Jorgen Lundman
6c2856726f Fix using zvol as slog device
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented.  This prevented
a zpool from use a zvol for its slog device.

This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c.  Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system.  If
they match we know the device is a zvol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1131
2012-12-18 11:02:28 -08:00