Commit Graph

3467 Commits

Author SHA1 Message Date
Justin T. Gibbs 4c7b7eedcd Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5630
  https://github.com/illumos/illumos-gate/commit/cd485b4

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:39 -07:00
Josef 'Jeff' Sipek 73ad4a9f3c Illumos 5047 - don't use atomic_*_nv if you discard the return value
5047 don't use atomic_*_nv if you discard the return value
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5047
  https://github.com/illumos/illumos-gate/commit/640c167

Porting Notes:

Several hunks from the original patch where not specific to ZFS
and thus were dropped.

Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
2015-03-12 15:40:33 -07:00
Brian Behlendorf 7f3e466283 Mark zfs_inactive() with PF_FSTRANS
Allowing direct reclaim to re-enter the VFS in the zfs_inactive()
call path has historically been problematic for ZoL.  Therefore,
in order to avoid an entire class of current and future issues
caused by this PF_FSTRANS is set for all zfs_inactive() callers.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3163
2015-03-10 09:21:48 -07:00
Ned Bass 417104bdd3 Use cached feature info in spa_add_feature_stats()
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa.  It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3082
2015-03-05 14:11:10 -08:00
Brian Behlendorf 989fd514b1 Change ASSERT(!"...") to cmn_err(CE_PANIC, ...)
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1445
2015-03-03 13:22:21 -08:00
Brian Behlendorf 8c45def24a Linux 4.0 compat: bdi_setup_and_register()
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed.  File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled.  A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.

* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.

I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported.  All kernels newer than
this will have one of the above interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #3128
2015-03-03 10:49:45 -08:00
Brian Behlendorf 4ec15b8dcf Use MUTEX_FSTRANS mutex type
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by using a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch
'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3050
Closes #3055
Closes #3062
Closes #3132
Closes #3142
Closes #2983
2015-03-03 10:46:40 -08:00
Brian Behlendorf 6ab08667a4 Reduce splat_taskq_test2_impl() stack frame size
Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl().  To address
this the tq_args have been moved to the heap.

  cc1: warnings being treated as errors
  spl-0.6.3/module/splat/splat-taskq.c:358:
  error: the frame size of 1040 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:18:31 -08:00
Brian Behlendorf d0d5dd7144 Add MUTEX_FSTRANS mutex type
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:18:24 -08:00
Isaac Huang d14cfd83da Fix deadlock between zpool export and zfs list
Pool reference count is NOT checked in spa_export_common()
if the pool has been imported readonly=on, i.e. spa->spa_sync_on
is FALSE. Then zpool export and zfs list may deadlock:

1. Pool A is imported readonly.
2. zpool export A and zfs list are run concurrently.
3. zfs command gets reference on the spa, which holds a dbuf on
   on the MOS meta dnode.
4. zpool command grabs spa_namespace_lock, and tries to evict dbufs
   of the MOS meta dnode. The dbuf held by zfs command can't be
   evicted as its reference count is not 0.
5. zpool command blocks in dnode_special_close() waiting for the
   MOS meta dnode reference count to drop to 0, with
   spa_namespace_lock held.
6. zfs command tries to get the spa_namespace_lock with a reference
   on the spa held, which holds a dbuf on the MOS meta dnode.
7. Now zpool command and zfs command deadlock each other.

Also any further zfs/zpool command will block on spa_namespace_lock
forever.

The fix is to always check pool reference count in spa_export_common(),
no matter whether the pool was imported readonly or not.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2034
2015-03-02 11:50:06 -08:00
Brian Behlendorf 87a63dd702 Prevent "zpool destroy|export" when suspended
Cleanly destroying or exporting a pool requires that the pool
not be suspended.  Therefore, set the POOL_CHECK_SUSPENDED flag
for these ioctls so the utilities will output a descriptive
error message rather than block.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2878
2015-03-02 11:50:06 -08:00
Brian Behlendorf c1bc8e610b Retire spl_module_init()/spl_module_fini()
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup.  This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985
2015-02-27 13:43:39 -08:00
Brian Behlendorf b4f3666a16 Retire spl_module_init()/spl_module_fini()
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup.  This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2985
2015-02-24 11:37:44 -08:00
Brian Behlendorf 1efdc45ea8 Fix O_APPEND open(2) flag
As described in flags section of open(2):

  O_APPEND:
    The  file  is  opened in append mode.  Before each write(2), the
    file offset is positioned at the end of the  file,  as  if  with
    lseek(2).   O_APPEND may lead to corrupted files on NFS filesys-
    tems if more than one process appends data to a  file  at  once.
    This is because NFS does not support appending to a file, so the
    client kernel has to simulate it, which can't be done without  a
    race condition.

This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem.  However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3124
2015-02-24 11:21:54 -08:00
Dan Swartzendruber 1611bb7b4f Set zfs_autoimport_disable default value to 1
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file.  This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.

1) modprobe
2) zpool import
3) zfs mount
4) zfs share

This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side.  It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.

Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2820
2015-02-17 16:09:41 -08:00
Brian Behlendorf 7d2868d5fc Skip bad DVAs during free by setting zfs_recover=1
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal.  This indicates that somehow a damaged
DVA was written to disk and that should be impossible.

However, we have seen a handful of reports over the years of pools
somehow being damaged in this way.  Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA.  Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.

Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3099
Issue #3090
Issue #2720
2015-02-13 16:02:04 -08:00
Andrey Vesnovaty 5f15fa2216 Fix readdir for .zfs/snapshot directory
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.

This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2930
2015-02-10 16:34:30 -08:00
Brian Behlendorf 3941503c0a Retire zio_cons()/zio_dest()
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue #2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:09:49 -08:00
Brian Behlendorf 6442f3cfe3 Retire zio_bulk_flags
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches.  Today
this code works well and there's no compelling reason to keep
this functionality.  In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:08:49 -08:00
Jörg Thalheim 534759fad3 Linux 3.19 compat: file_inode was added
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)

Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3084
2015-02-10 11:24:51 -08:00
Brian Behlendorf 77aef6f60e Use vmem_alloc() for nvlists
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold.  Convert them to use vmem_alloc() so the
best allocator is used.

Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings.  Concurrently the large allocation warning threshold
was increased from 8k to 32k.  The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k.  This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3057
Closes #3079
Closes #3081
2015-02-10 11:00:08 -08:00
Brian Behlendorf afe373260e Revert "Don't read space maps during import for readonly pools"
This reverts commit 7fc8c33ede which
accidentally introduced a ztest failure.

ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-02-09 16:56:59 -08:00
Brian Behlendorf 7fc8c33ede Don't read space maps during import for readonly pools
Normally when importing a pool the space maps for all top level
vdevs are read from disk.  The space maps will be required latter
when an allocation is performed and free blocks need to be located.

However, if the pool is imported readonly then we are guaranteed
that no allocations can occur.  In this case the space maps need
not be loaded..   A similar argument can be made for the DTLs
(dirty time logs).

Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
2015-02-09 16:43:03 -08:00
Justin T. Gibbs 33b4de513e Illumos 5311 - traverse_dnode may report success when it should not
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/2a89c2c
  https://www.illumos.org/issues/5311

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2970
2015-02-06 12:07:15 -08:00
Ned Bass a62d1b02e3 Fix SA header size accounting
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.

  ln -s $(perl -e 'print "z" x 20') file
  setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file

Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.

Also improve the comments in sa_find_sizes().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3070
2015-02-06 09:26:46 -08:00
Brian Behlendorf e2c4acde55 Skip evicting dbufs when walking the dbuf hash
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list.  In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2553
Closes #2495
2015-02-06 09:24:28 -08:00
Chunwei Chen 086476f920 Fix spl_hostid module parameter
Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #427
2015-02-04 16:42:25 -08:00
Tim Chase aa2ef419e4 Spurious ENOMEM returns when reading dbufs kstat
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller.  This caused spurious
ENOMEM errors when reading the dbufs kstat.

This commit causes the actual number of characters written to be returned.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3072
2015-02-04 16:35:26 -08:00
avg 037763e44e fix l2arc compression buffers leak
Commit log from FreeBSD:

    We have observed that arc_release() can be called concurrently with a
    l2arc in-flight write.  Also, we have observed that arc_hdr_destroy()
    can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
    flag in similar circumstances.

    Previously the l2arc headers would be freed while leaking their
    associated compression buffers.  Now the buffers are placed on
    l2arc_free_on_write list for delayed freeing.  This is similar to
    what was already done to arc buffers that were supposed to be freed
    concurrently with in-flight writes of those buffers.

    In addition to fixing the discovered leaks this change also adds
    some protective code to assert that a compression buffer associated
    with a l2arc header is never leaked.

    A new kstat l2_cdata_free_on_write is added.  It keeps a count
    of delayed compression buffer frees which previously would have
    been leaks.

    Tested by:  Vitalij Satanivskij <satan@ukr.net> et al
    Requested by:       many
    MFC after:  2 weeks
    Sponsored by:       HybridCluster / ClusterHQ

References:
    https://illumos.org/issues/5222
    https://github.com/freebsd/freebsd/commit/b98f85d
    http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781
    http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html
    http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3029
2015-02-03 16:54:16 -08:00
Brian Behlendorf 19ea3d25df Use zio buffers in zil_itx_create()
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K.  This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.

In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure.  This way we're always sure the
right allocation functions are used.  Notice that in the current
code kmem_free() and vmem_free() were both used.  This happened to
work because these wrappers map to the same internal SPL function.

This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings.  There
was no deadlock here just repeated allocation failures resulting
in increased latency.

Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P           O 3.2.0-0.bpo.4-amd64
Call Trace:
 [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
 [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
 [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
 [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
 [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
 [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
 [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
 [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
 [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3059
2015-02-02 11:20:41 -08:00
Brian Behlendorf c7db36a3c4 Optimize vmem_alloc() retry path
For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max.  This
allows for more concurrency in the system and less contention of
the virtual address space.  Generally, this is a good thing.

However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again.  It will likely fail
in exactly the same way.  A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #428
2015-02-02 10:57:56 -08:00
Brian Behlendorf 0365064a97 Handle closing an unopened ZVOL
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool.  Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().

In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:14 -08:00
Brian Behlendorf a127e841de Add zvol_open() error handling for readonly property
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:06 -08:00
Tim Chase b0cf0676c0 Fix removal of SA in sa_modify_attrs()
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR().  The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028
2015-01-21 16:35:14 -08:00
Richard Yao 841c9d43c7 Use kmem_vasprintf() in log_internal()
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.

It also brings this function almost back in sync with Illumos.  This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781
2015-01-21 15:30:24 -08:00
Brian Behlendorf 54cccfc2e3 Fix GFP_KERNEL allocations flags
The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags.  This ensures that they can be safely called in any
context and the correct flags will be used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426
2015-01-21 15:25:19 -08:00
Tim Chase 3c832b8cc1 Linux 3.12 compat: split shrinker has s_shrink
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.

This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2975
2015-01-20 14:07:59 -08:00
Brian Behlendorf 81971b137a Revert "SA spill block cache"
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf 285b29d959 Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf 79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf efcd79a883 Retire KM_NODEBUG
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:40:32 -08:00
Richard Yao 71f8548ea4 Use is_vmalloc_addr() in vdev_disk.c
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf 92119cc259 Mark IO pipeline with PF_FSTRANS
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf ee33517452 Use __get_free_pages() for emergency objects
The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored.  This is due to
kmalloc() being layered on the generic Linux slab caches.  It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:58:11 -08:00
Brian Behlendorf 436ad60faa Fix kmem cache deadlock logic
The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation.  This is
done because large slabs must be allocated using vmalloc().  It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored.  This can result in a deadlock.

Therefore, a deadlock detection strategy was implemented to deal with
this case.  When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab.  Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.

This logic was correct but an optimization in the code allowed for a
deadlock.  Because only slabs backed by vmalloc() can deadlock in the
way described above.  An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches.  This had the
advantage of making it easy to distinguish these objects when they
were freed.

But this isn't strictly safe.  If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either.  This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.

The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches.  Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.

For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable.  This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked.  However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf 3018bffa9b Refine slab cache sizing
This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf e50e6cc958 Reduce kmem cache deadlock threshold
Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10.  The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly.  By
reducing it we allow the system to resolve the issue more quickly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf 1a20496834 Make slab reclaim more aggressive
Many people have noticed that the kmem cache implementation is slow
to release its memory.  This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty.  Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao a988a35a93 Enforce architecture-specific barriers around clear_bit()
The comment above the Linux 3.16 kernel's clear_bit() states:

/**
 * clear_bit - Clears a bit in memory
 * @nr: Bit to clear
 * @addr: Address to start counting from
 *
 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
 * in order to ensure changes are visible on other processors.
 */

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao c2fa09454e Add hooks for disabling direct reclaim
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00