Commit Graph

1461 Commits

Author SHA1 Message Date
Matthew Ahrens e022864d19 Illumos 5176 - lock contention on godfather zio
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5176
  https://github.com/illumos/illumos-gate/commit/6f834bc

Porting notes:

Under Linux max_ncpus is defined as num_possible_cpus().  This is
largest number of cpu ids which might be available during the life
time of the system boot.  This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2711
2014-10-07 11:24:24 -07:00
Richard Yao 83e9986f6e Implement -t option to zpool create for temporary pool names
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.

26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.

This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
2014-09-30 10:46:59 -07:00
Brian Behlendorf aa0ac7caa4 Make user stack limit configurable
To aid in detecting and debugging stack overflow issues make the
user space stack limit configurable via a new ZFS_STACK_SIZE
environment variable.  The value assigned to ZFS_STACK_SIZE will
be used as the default stack size in bytes.

Because this is mainly useful as a debugging aid in conjunction
with ztest the stack limit is disabled by default.  See the ztest(1)
man page for additional details on using the ZFS_STACK_SIZE
environment variable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2743
Issue #2293
2014-09-30 10:46:55 -07:00
Alex Reece acbad6ff67 Illumos 4753 - increase number of outstanding async writes when sync task is waiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/4753
    https://github.com/illumos/illumos-gate/commit/73527f4

Comments by Matt Ahrens from the issue tracker:
    When a sync task is waiting for a txg to complete, we should hurry
    it along by increasing the number of outstanding async writes
    (i.e. make vdev_queue_max_async_writes() return a larger number).
    Initially we might just have a tunable for "minimum async writes
    while a synctask is waiting" and set it to 3.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2716
2014-09-23 13:50:55 -07:00
Tim Chase 223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
Richard Yao cd3939c5f0 Linux AIO Support
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #223
Closes #2373
2014-09-05 15:11:43 -07:00
Isaac Huang 0426c16804 Fixed memory leaks in zevent handling
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2158
2014-08-20 10:45:16 -07:00
Matthew Ahrens bd089c5477 Illumos 4631 - zvol_get_stats triggering too many reads
4631 zvol_get_stats triggering too many reads

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4631
  https://github.com/illumos/illumos-gate/commit/bbfa8ea

Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2612
Closes #2480
2014-08-20 09:17:00 -07:00
stf f9bde4f74b Avoid PAGESIZE redefinition
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.

Signed-off-by: stf <s@ctrlc.hu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #382
2014-08-18 08:55:41 -07:00
George Wilson f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Turbo Fredriksson f67d709080 Create an 'overlay' property
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.

Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
2014-08-15 13:39:19 -07:00
Richard Yao 194e56234a Include sys/taskq.h in linux/vfs_compat.h
We should have included sys/taskq.h directly because we use the taskq
code here, but we instead had files that included sys/taskq.h also
include sys/kmem.h, which happened to include sys/taskq.h. sys/kmem.h no
longer does this, so we must define the include as we should
have done in the first place.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2411
2014-08-14 12:38:17 -07:00
Richard Yao ec18fe3ce8 Cleanup vn_rename() and vn_remove()
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #370
2014-08-13 16:25:44 -07:00
Alec Salazar 22a11a5b5a Replace __va_list with va_list
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.

Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2588
2014-08-13 10:35:00 -07:00
Brian Behlendorf 0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Ned Bass 2fc44f66ec Linux 3.17 compat: remove wait_on_bit action function
Linux kernel 3.17 removes the action function argument from
wait_on_bit().  Add autoconf test and compatibility macro to support
the new interface.

The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule().  This API change
was made to consolidate all of those redundant wait functions.

References: torvalds/linux@7431620

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #378
2014-08-11 14:17:00 -07:00
Brian Behlendorf 50b25b2187 Avoid dynamic allocation of 'search zio'
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage.  Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.

To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure.  All
access must be serialized through the vq_lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2572
2014-08-11 08:44:54 -07:00
Matthew Ahrens 5dbd68a352 Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t
4914 zfs on-disk bookmark structure should be named *_phys_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4914
  https://github.com/illumos/illumos-gate/commit/7802d7b

Porting notes:

There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated.  This should reduce the likelihood of further
problems like issue #2094 from occurring.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2558
2014-08-06 14:48:41 -07:00
Matthew Ahrens fbeddd60b7 Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4390
  https://github.com/illumos/illumos-gate/commit/7fd05ac

Porting notes:

Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code.  This patch should reduce its
stack footprint a bit more.

The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.

The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global.  Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2545
2014-08-04 11:50:52 -07:00
Matthew Ahrens 9b67f60560 Illumos 4757, 4913
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4757
  https://www.illumos.org/issues/4913
  https://github.com/illumos/illumos-gate/commit/5d7b4d4

Porting notes:

For compatibility with the fastpath code the zio_done() function
needed to be updated.  Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2544
2014-08-01 14:28:05 -07:00
Matthew Ahrens faf0f58c69 Illumos 3835 zfs need not store 2 copies of all metadata
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Description from Matt Ahrens's bug report at Delphix:

    Add a new zfs property, "redundant_metadata" which can have values
    "all" or "most".  The default will be "all", which is the current
    behavior.  Setting to "most" will cause us to only store 1 copy of
    level-1 indirect blocks of user data files.

Additional notes:

    The new man page section for this property states

        "The exact behavior of which metadata blocks
         are stored redundantly may change in future releases."

    and:

        "When set to most, ZFS stores an extra copy of most types of
         metadata. This can improve performance of random writes,
         because less metadata must be written."

    The current implementation is as described above in Matt's blog.
    It is controlled by a new global integer
    "zfs_redundant_metadata_most_ditto_level", currently initialized
    to 2. When "redundant_metadata" is set to "most", only indirect
    blocks of the specified level and higher will have additional ditto
    blocks created.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2542
2014-07-31 09:49:34 -07:00
George Wilson 672692c7b7 Illumos 4754, 4755
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4754
  https://www.illumos.org/issues/4755
  https://github.com/illumos/illumos-gate/commit/b6240e8

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2533
2014-07-30 10:30:05 -07:00
Matthew Ahrens 9bd274ddd8 Illumos #4374
4374 dn_free_ranges should use range_tree_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4374
  https://github.com/illumos/illumos-gate/commit/bf16b11

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2531
2014-07-30 09:20:35 -07:00
Matthew Ahrens da536844d5 Illumos 4368, 4369.
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4369
  https://www.illumos.org/issues/4368
  https://github.com/illumos/illumos-gate/commit/78f1710

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2530
2014-07-29 10:55:29 -07:00
Max Grossman b0bc7a84d9 Illumos 4370, 4371
4370 avoid transmitting holes during zfs send
4371 DMU code clean up

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4370
  https://www.illumos.org/issues/4371
  https://github.com/illumos/illumos-gate/commit/43466aa

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2529
2014-07-28 14:29:58 -07:00
Tim Chase 2bf35fb754 Add atomic_swap_32() and atomic_swap_64()
The atomic_swap_32() function maps to atomic_xchg(), and
the atomic_swap_64() function maps to atomic64_xchg().

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #377
2014-07-28 14:19:24 -07:00
Matthew Ahrens fa86b5dbb6 Illumos 4171, 4172
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features

Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4171
  https://www.illumos.org/issues/4172
  https://github.com/illumos/illumos-gate/commit/2acef22

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2528
2014-07-25 16:40:07 -07:00
Jan Engelhardt aca19e063b Do not attempt access beyond the declared end of the dn_blkptr array
This loop in dmu_objset_write_ready():

	for (i = 0; i < dnp->dn_nblkptr; i++)
		bp->blk_fill += dnp->dn_blkptr[i].blk_fill;

invokes _undefined behavior_ for the (common) case of dn_nblkptr=3,
therefore, the compiler is free to do whatever it wants (such as
optimizing it away, or otherwise messing up your expections).

The fix is to be honest about the array size.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2511
Closes #2010
2014-07-22 09:55:37 -07:00
Tim Chase 7f23e00109 Add functions and macros as used upstream.
Added highbit64() and howmany() which are used in recent upstream
code.  Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #363
2014-07-22 09:47:48 -07:00
George Wilson 93cf20764a Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.

This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.

The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram

In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:

    * 4K sector devices will not see any compression benefit
    * large space_maps require more metadata on-disk
    * large space_maps require more time to load (typically random reads)

Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.

A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.

References:
  https://www.illumos.org/issues/4101
  https://www.illumos.org/issues/4102
  https://www.illumos.org/issues/4103
  https://www.illumos.org/issues/4105
  https://www.illumos.org/issues/4106
  https://github.com/illumos/illumos-gate/commit/0713e23

Porting notes:

A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2014-07-22 09:39:16 -07:00
Brian Behlendorf 1e8db77102 Fix zil_commit() NULL dereference
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot.  If they do somehow get
dirtied an attempt will make made to write them to disk.  In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2405
2014-07-17 15:15:07 -07:00
Tim Chase f6a869614e Safer debugging and assertion macros.
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.

The following macros have been converted to not use do/while(0):
	PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL

PANIC has been converted to a wrapper around the new spl_PANIC() function.

The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().

The __ASSERT() macro was not touched.  It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2014-07-01 15:14:43 -07:00
Chris Wedgwood 62a05896e8 Allow building without ACLs
Some kernel definitions were buried inside the #if... #endif logic for
ACLs.  When ACLs are not available these definitions get lost causing
the build to fail.

Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2349
2014-05-30 12:01:57 -07:00
Brian Behlendorf 376dc35e22 Add spl_kmem_cache_reclaim module option
The correct behavior for all registered shrinkers is to return the
number of objects in their cache.  In theory this allows the Linux
VM to balance memory reclaim across all registered caches.

In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim.  This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim.  Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.

Therefore, this patch makes this behavior tunable.  Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once.  This is the default behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #358
2014-05-22 10:30:12 -07:00
Brian Behlendorf a073aeb060 Add KMC_SLAB cache type
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL.  These include:

1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.

Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator.  This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on.  However, there are some things we need to be careful of:

1) The Linux slab allocator was never designed to work well with
   large objects.  Because the SPL slab must still handle this use
   case a cut off limit was added to transition from Linux slab
   backed objects to kmem or vmem backed slabs.

   spl_kmem_cache_slab_limit - Objects less than or equal to this
   size in bytes will be backed by the Linux slab.  By default
   this value is zero which disables the Linux slab functionality.
   Reasonable values for this cut off limit are in the range of
   4096-16386 bytes.

   spl_kmem_cache_kmem_limit - Objects less than or equal to this
   size in bytes will be backed by a kmem slab.  Objects over this
   size will be vmem backed instead.  This value defaults to
   1/8 a page, or 512 bytes on an x86_64 architecture.

2) Be aware that using the Linux slab may inadvertently introduce
   new deadlocks.  Care has been taken previously to ensure that
   all allocations which occur in the write path use GFP_NOIO.
   However, there may be internal allocations performed in the
   Linux slab which do not honor these flags.  If this is the case
   a deadlock may occur.

The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us.  And ideally
need to move completely away from using the SPLs slab for large
memory allocations.  This patch is a first step.

NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
   backed caches but it is not supported for kmem/vmem backed caches.

2) Regardless of the spl_kmem_cache_*_limit settings a cache may
   be explicitly set to a given type by passed the KMC_KMEM,
   KMC_VMEM, or KMC_SLAB flags during cache creation.

3) The constructors, destructors, and reclaim callbacks are all
   functional and will be called regardless of the cache type.

4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
   the issues involved in presenting correct object accounting.
   Instead they will appear in /proc/slabinfo under the same names.

5) Several kmem SPLAT tests needed to be fixed because they relied
   incorrectly on internal kmem slab accounting.  With the updated
   test cases all the SPLAT tests pass as expected.

6) An autoconf test was added to ensure that the __GFP_COMP flag
   was correctly added to the default flags used when allocating
   a slab.  This is required to ensure all pages in higher order
   slabs are properly refcounted, see ae16ed9.

7) When using the SLUB allocator there is no need to attempt to
   set the __GFP_COMP flag.  This has been the default behavior
   for the SLUB since Linux 2.6.25.

8) When using the SLUB it may be desirable to set the slub_nomerge
   kernel parameter to prevent caches from being merged.

Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #356
2014-05-22 10:28:01 -07:00
Tim Chase 3937ab20f3 Allow for lock-free reading zfsdev_state_list.
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed.  It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.

This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.

The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.

Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2301
2014-05-19 11:45:11 -07:00
Chunwei Chen bc25c9325b Use a dedicated taskq for vdev_file
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.

We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2270
2014-05-14 16:20:21 -07:00
Chunwei Chen 1538f4b6e3 Linux 3.15 compat: NICE_TO_PRIO and PRIO_TO_NICE
These macro's were exposed to make them available to other
parts of the kernel and modules.

References:
  torvalds/linux@6b6350f

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
2014-05-07 13:38:03 -07:00
Tim Chase 962d524212 Check the dataset type more rigorously when fetching properties.
When fetching property values of snapshots, a check against the head
dataset type must be performed.  Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".

This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes.  It also did not allow for the display of
"volsize" of a snapshot of a volume.

This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly.  This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2265
2014-05-06 10:41:46 -07:00
Richard Yao 3af3df905f libspl: Implement LWP rwlock interface
This implements a subset of the LWP rwlock interface by wrapping the
equivalent POSIX thread interface. It is a superset of the features
needed by ztest.

The missing bits are {,_}rw_read_held() and {,_}rw_write_held().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1970
2014-05-01 15:53:52 -07:00
Richard Yao 9d317793aa Implement File Attribute Support
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().

Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method.  That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.

References:
  https://bugs.gentoo.org/show_bug.cgi?id=483516

Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1691
2014-05-01 10:11:18 -07:00
Jorgen Lundman d6e6e4a98e Add support for aarch64 (ARMv8)
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).

As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.

Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #351
2014-04-25 15:25:32 -07:00
Chunwei Chen 0b75bdb369 Use ddi_time_after and friends to compare time
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2142
2014-04-14 13:27:56 -07:00
Chunwei Chen 545e9ac00a Add ddi_time_after and friends
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.

This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #335
2014-04-14 09:32:01 -07:00
Yuxuan Shui 6c48cd8ac2 This patch add a CTASSERT macro for compile time assertion.
This macro makes the compile to spit "mixed definition and code"
warning, I can't find a way to avoid it.

This patch lays some groundwork for the persistent l2arc feature.
See https://www.illumos.org/issues/3525.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #303
2014-04-14 09:28:53 -07:00
Richard Yao acf0ade362 Simplify hostid logic
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist

Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224
2014-04-14 09:04:41 -07:00
Chunwei Chen b761912b34 Linux 3.14 compat: rq_for_each_segment in dmu_req_copy
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:51 -07:00
Chunwei Chen d4541210f3 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:28:38 -07:00
Chunwei Chen 408ec0d2e1 Linux 3.14 compat: posix_acl_{create,chmod}
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
2014-04-10 14:27:03 -07:00
Tim Chase ed650dee76 De-inline spl_kthread_create().
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #346
2014-04-09 19:17:12 -07:00