Commit Graph

2134 Commits

Author SHA1 Message Date
Matthew Ahrens
87c25d567f abd_alloc should use scatter for >1K allocations
abd_alloc() normally does scatter allocations, thus solving the problem
that ABD originally set out to: the bulk of ZFS's allocations are single
pages, which are faster to allocate and free, and don't suffer from
internal fragmentation (and the inability to reclaim memory because some
buffers in the slab are still allocated).

However, the current code does linear allocations for 4KB and smaller
allocations, defeating the purpose of ABD.

Scatter ABD's use at least one page each, so sub-page allocations waste
some space when allocated as scatter (e.g. 2KB scatter allocation wastes
half of each page).  Using linear ABD's for small allocations means that
they will be put on slabs which contain many allocations.  This can
improve memory efficiency, but it also makes it much harder for ARC
evictions to actually free pages, because all the buffers on one slab
need to be freed in order for the slab (and underlying pages) to be
freed.  Typically, 512B and 1KB kmem caches have 16 buffers per slab, so
it's possible for them to actually waste more memory than scatter (one
page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting
15/16th).

Spill blocks are typically 512B and are heavily used on systems running
selinux with the default dnode size and the `xattr=sa` property set.

By default we will use linear allocations for 512B and 1KB, and scatter
allocations for larger (1.5KB and up).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by:  DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8455
2019-02-28 17:52:55 -08:00
Brian Behlendorf
6af7ba417e
Fix overly broad spa config lock
The spa_txg_history_init_io() and spa_txg_history_fini_io() were
mistakenly taking SCL_ALL when only SCL_CONFIG is required to
access the vdev stats.  This could result in a deadlock which
was observed when running ztest.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8445
2019-02-27 10:49:22 -08:00
Serapheim Dimitropoulos
8eef997679 Error path in metaslab_load_impl() forgets to drop ms_sync_lock
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8444
2019-02-25 11:08:52 -08:00
loli10K
c44a3ec059 zvol: allow rename of in use ZVOL dataset
While ZFS allow renaming of in use ZVOLs at the DSL level without issues
the ZVOL layer does not correctly update the renamed dataset if the
device node is open (zv->zv_open_count > 0): trying to access the stale
dataset name, for instance during a zfs receive, will cause the
following failure:

VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00)
PANIC at zvol.c:1255:zvol_resume()
Showing stack for process 1390
CPU: 0 PID: 1390 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30
 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40
 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? rrw_exit+0xc8/0x2e0 [zfs]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs]
 [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs]
 [<0>] ? zvol_resume+0x178/0x280 [zfs]
 [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs]
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs]
 [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs]
 [<0>] ? __alloc_pages_nodemask+0x166/0xb50
 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs]
 [<0>] ? handle_mm_fault+0x464/0x1140
 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<0>] ? __do_page_fault+0x177/0x410
 [<0>] ? SyS_ioctl+0x81/0xa0
 [<0>] ? async_page_fault+0x28/0x30
 [<0>] ? system_call_fast_compare_end+0x10/0x15

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6263 
Closes #8371
2019-02-22 15:38:42 -08:00
loli10K
0c637f3100 zpool reports 16E expandsize on disks with oddball number of sectors
The issue is caused by a small discrepancy in how userland creates the
partition layout and the kernel estimates available space:

 * zpool command: subtract 9M from the usable device size, then align
   to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M
   EFI "reserved" partition.

 * kernel module: subtract 10M from the device size. 10M is the sum of
   1M "start" partition alignment + 1m "end" partition alignment + 8M
   EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the
alignment size the zpool command will create a partition layout which
reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up
being slightly smaller than asize: this results in a huge number (16E)
reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller
than bdev_capacity().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1468 
Closes #8391
2019-02-22 15:36:34 -08:00
lidongyang
8d9e51c084 Fix dnode_hold_impl() soft lockup
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes #8433
2019-02-22 09:48:37 -08:00
Tomohiro Kusumi
9abbee4912 Don't enter zvol's rangelock for read bio with size 0
The SCST driver (SCSI target driver implementation) and possibly
others may issue read bio's with a length of zero bytes. Although
this is unusual, such bio's issued under certain condition can cause
kernel oops, due to how rangelock is implemented.

rangelock_add_reader() is not made to handle overlap of two (or more)
ranges from read bio's with the same offset when one of them has size
of 0, even though they conceptually overlap. Allowing them to enter
rangelock results in kernel oops by dereferencing invalid pointer,
or assertion failure on AVL tree manipulation with debug enabled
kernel module.

For example, this happens when read bio whose (offset, size) is
(0, 0) enters rangelock followed by another read bio with (0, 4096)
when (0, 0) rangelock is still locked, when there are no pending
write bio's. It can also happen with reverse order, which is (0, N)
followed by (0, 0) when (0, N) is still locked. More details
mentioned in #8379.

Kernel Oops on ->make_request_fn() of ZFS volume
https://github.com/zfsonlinux/zfs/issues/8379

Prevent this by returning bio with size 0 as success without entering
rangelock. This has been done for write bio after checking flusher
bio case (though not for the same reason), but not for read bio.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8379 
Closes #8401
2019-02-20 10:14:36 -08:00
Serapheim Dimitropoulos
928e8ad47d Introduce auxiliary metaslab histograms
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.

The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram.  The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.

In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8358
2019-02-20 09:59:56 -08:00
loli10K
bb1be77a35 Prevent user accounting on readonly pool
Trying to mount a dataset from a readonly pool could inadvertently start
the user accounting upgrade task, leading to the following failure:

VERIFY3(tx->tx_threads == 2) failed (0 == 2)
PANIC at txg.c:680:txg_wait_synced()
Showing stack for process 2541
CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs]
 [<0>] ? dmu_object_next+0x77/0x130 [zfs]
 [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs]
 [<0>] ? txg_wait_synced+0x91/0x220 [zfs]
 [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs]
 [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs]
 [<0>] ? taskq_thread+0x2cc/0x5d0 [spl]
 [<0>] ? wake_up_state+0x10/0x10
 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl]
 [<0>] ? kthread+0xbd/0xe0
 [<0>] ? kthread_create_on_node+0x180/0x180
 [<0>] ? ret_from_fork+0x58/0x90
 [<0>] ? kthread_create_on_node+0x180/0x180

This patch updates both functions responsible for checking if we can
perform user accounting to verify the pool is not readonly.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8424
2019-02-19 18:41:18 -08:00
Sara Hartse
f545b6ae00 Delay injection can cause indefinitely hung zios
If we hit the (NSEC_TO_TICK(diff) == 0) condition in
zio_delay_interrupt, zio_interrupt is never called and the
zio does not progress.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8404
2019-02-15 14:44:56 -08:00
Tim Chase
638dd5f44e zio_deadman_impl() fix and enhancement
Add the zio_deadman_log_all tunable to print all zios in
zio_deadman_impl().  Also, in all cases, display the depth of the
zio relative to the original parent zio.  This is meant to be used by
developers to gain diagnostic information for hangs which don't involve
fully set-up zio trees or are otherwise stuck or hung in an early stage.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8362
2019-02-15 12:44:24 -08:00
Paul Zuchowski
9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Tomohiro Kusumi
2d76ab9e42 Fix obsolete comment on rangelock
5d43cc9a59 renamed it to rangelock_enter().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8408
2019-02-14 15:13:58 -08:00
Alek P
65282ee9e0 Freeing throttle should account for holes
Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7725 
Closes #7888
2019-02-12 12:01:08 -08:00
Alek P
dcec0a12c8 port async unlinked drain from illumos-nexenta
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.

Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8142
2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos
425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00
loli10K
d8d418ff0c ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2019-02-08 15:44:15 -08:00
loli10K
4417096956 Pool allocation classes misplacing small file blocks
Due to an off-by-one condition in spa_preferred_class() we are picking
the "normal" allocation class instead of the "special" one for file
blocks with size equal to the special_small_blocks property value.

This change fix the small code issue, update the ZFS Test Suite and the
zfs(8) man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8351
Closes #8361
2019-02-08 12:32:12 -08:00
Tim Chase
0902c4577f Fix ARC stats for embedded blkptrs
Re-factor arc_read() to better account for embedded data blkptrs.
Previously, reading the payload from an embedded blkptr would cause
arcstats such as demand_metadata_misses to be bumped when there was
actually no cache "miss" because the data are already available in
the blkptr.

The following test procedure was used to demonstrate the problem:

   zpool create tank ...
   zfs create -o compression=lz4 tank/fs
   echo blah > /tank/fs/blah
   stat /tank/fs/blah
   grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats

and repeating the last two steps to watch the metadata miss counter
increment.  This can also be demonstrated via the  zfs_arc_miss DTRACE4
probe in arc_read().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8319
2019-02-04 09:33:30 -08:00
Serapheim Dimitropoulos
6c926f426a Simplify log vdev removal code
Get rid of the majority metaslab metadata when removing log vdevs
in spa_vdev_remove_log() with a call to metaslab_fini() instead
of duplicating a lot of that in vdev_remove_empty_log().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8347
2019-01-31 09:17:52 -08:00
Serapheim Dimitropoulos
7558997d2f vs_alloc can underflow in L2ARC vdevs
The current L2 ARC device code consistently uses psize to
increment vs_alloc but varies between psize and lsize when
decrementing it. The result of this behavior is that
vs_alloc can be decremented more that it is incremented
and underflow. This patch changes the code so asize is
used anywhere.

In addition, it ensures that vs_alloc gets incremented by
the L2 ARC device code as buffers are written and not at
the end of the l2arc_write_buffers() routine. The latter
(and old) way would temporarily underflow vs_alloc as
buffers that were just written, would be destroyed while
l2arc_write_buffers() was still looping.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8298
2019-01-31 09:16:39 -08:00
Sara Hartse
2747f599ff Don't acquire zthr_request_lock in zthr_wakeup
Address a deadlock caused by simultaneous wakeup and cancel on a zthr
by remove the hold of zthr_request_lock from zthr_wakeup. This
allows thr_wakeup to not block a thread that is in the process of
being cancelled.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8333
2019-01-30 12:31:16 -08:00
Brian Behlendorf
26a856594f Linux 5.0 compat: Fix bio_set_dev()
The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the
GPL-only bio_associate_blkg() symbol thus inadvertently converting
the entire macro.  Provide a minimal version which always assigns the
request queue's root_blkg to the bio.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8287
2019-01-28 10:11:45 -08:00
Tony Hutter
05805494dd Linux 5.0 compat: Convert MS_* macros to SB_*
In the 5.0 kernel, only the mount namespace code should use the MS_*
macos. Filesystems should use the SB_* ones.

https://patchwork.kernel.org/patch/10552493/

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8264
2019-01-28 10:11:39 -08:00
Tony Hutter
031cea17a3 Linux 5.0 compat: Use totalram_pages()
totalram_pages() was converted to an atomic variable in 5.0:

https://patchwork.kernel.org/patch/10652795/

Its value should now be read though the totalram_pages() helper
function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8263
2019-01-28 10:11:14 -08:00
Serapheim Dimitropoulos
c853f382db Change target size of metaslabs from 256GB to 16GB
= Old behavior

For vdev sizes 100GB to 50TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 256GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= New Behavior

For vdev sizes 100GB to 3TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 16GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= Reasoning

The old behavior makes metaslabs grow in size when
the vdev range is between 3TB (ms_size 16GB) and
32PB (ms_size 256GB). Even though keeping the number
of metaslabs is good in terms of potential number of
I/Os per TXG, these bigger metaslabs take longer
to be loaded and after they are loaded they can
take up a lot of memory because of their range trees.

This change tries to put a boundary in memory and
loading time for the specific range of vdev sizes.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8324
2019-01-25 16:38:27 -08:00
Serapheim Dimitropoulos
df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8327
2019-01-25 09:51:24 -08:00
Tim Chase
107dd2b174 Use proper tag for spa config refcounts in mmp_write_uberblock()
This allows the spa config refcounts to use tracking in debug builds
without triggering the "No such hold %p on refcount" panic.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8326
2019-01-25 09:50:06 -08:00
Tom Caputi
b5d693581d Fix bad kmem_free() in zvol_rename_minors_impl()
Currently, zvol_rename_minors_impl() calls kmem_asprintf()
to allocate and initialize a string. This function is a thin
wrapper around the kernel's kvasprintf() and does not call
into the SPL's kmem tracking code when it is enabled. However,
this function frees the string with the tracked kmem_free()
instead of the untracked strfree(), which causes the SPL
kmem tracking code to believe that the function is attempting
to free memory it never allocated, triggering an ASSERT. This
patch simply corrects this issue.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8307
2019-01-23 11:38:05 -08:00
loli10K
0a10863194 ztest: creates partially initialized root dataset
Since d8fdfc2 was integrated dsl_pool_create() does not call
dmu_objset_create_impl() for the root dataset when running in
userland (ztest): this creates a pool with a partially initialized
root dataset. Trying to import and use this pool results in both
zpool and zfs executables dumping core.

Fix this by adopting an alternative change suggested in OpenZFS 8607
code review.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Original-patch-by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8277
2019-01-18 11:14:01 -08:00
Brian Behlendorf
ad63507135
Remove zfs_sync() panicking kernel check
This check provides no real additional protection and unnecessarily
introduces a dependency on the "oops_in_progress" kernel symbol.
Remove the check, it there are special circumstances on other
platforms which make this a requirement it can be reintroduced
for all relevant call paths in a more portable comprehensive manor.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8297
2019-01-18 11:11:47 -08:00
Serapheim Dimitropoulos
b194fab0fb Factor metaslab_load_wait() in metaslab_load()
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.

Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8290
2019-01-18 11:10:32 -08:00
Tom Caputi
960347d3a6 Fix 0 byte memory leak in zfs receive
Currently, when a DRR_OBJECT record is read into memory in
receive_read_record(), memory is allocated for the bonus buffer.
However, if the object doesn't have a bonus buffer the code will
still "allocate" the zero bytes, but the memory will not be passed
to the processing thread for cleanup later. This causes the spl
kmem tracking code to report a leak. This patch simply changes the
code so that it only allocates this memory if it has a non-zero
length.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8266
2019-01-18 11:06:48 -08:00
loli10K
60b0a963f5 Off-by-one in zap_leaf_array_create()
Trying to set user properties with their length 1 byte shorter than the
maximum size triggers an assertion failure in zap_leaf_array_create():

  panic[cpu0]/thread=ffffff000a092c40:
  assertion failed: num_integers * integer_size < (8<<10) (0x2000 < 0x2000), file: ../../common/fs/zfs/zap_leaf.c, line: 233

  ffffff000a092500 genunix:process_type+167c35 ()
  ffffff000a0925a0 zfs:zap_leaf_array_create+1d2 ()
  ffffff000a092650 zfs:zap_entry_create+1be ()
  ffffff000a092720 zfs:fzap_update+ed ()
  ffffff000a0927d0 zfs:zap_update+1a5 ()
  ffffff000a0928d0 zfs:dsl_prop_set_sync_impl+5c6 ()
  ffffff000a092970 zfs:dsl_props_set_sync_impl+fc ()
  ffffff000a0929b0 zfs:dsl_props_set_sync+79 ()
  ffffff000a0929f0 zfs:dsl_sync_task_sync+10a ()
  ffffff000a092a80 zfs:dsl_pool_sync+3a3 ()
  ffffff000a092b50 zfs:spa_sync+4e6 ()
  ffffff000a092c20 zfs:txg_sync_thread+297 ()
  ffffff000a092c30 unix:thread_start+8 ()

This patch simply corrects the assertion.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8278
2019-01-18 09:58:46 -08:00
Serapheim Dimitropoulos
8dc2197b7b Simplify spa_sync by breaking it up to smaller functions
The point of this refactoring is to break the high-level conceptual 
steps of spa_sync() to their own helper functions. In general large 
functions can enhance readability if structured well, but in this
case the amount of conceptual steps taken could use the help of 
helper functions.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8293
2019-01-18 09:50:16 -08:00
Tom Caputi
305781da4b Fix error handling incallers of dbuf_hold_level()
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8291
2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos
75058f3303 Remove unused vdev_t fields
The following fields from the vdev_t struct are not used anywhere.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8285
2019-01-17 15:41:12 -08:00
Brian Behlendorf
52b684236d
ztest: scrub ddt repair
The ztest_ddt_repair() test is designed inflict damage to the
ddt which can be repairable by a scrub.  Unfortunately, this
repair logic was broken at some point and it went undetected.
This issue is not specific to ztest, but thankfully this extra
redundancy is rarely enabled and even more rarely needed.

The root cause was identified to be the ddt_bp_create()
function called by dsl_scan_ddt_entry() which did not set the
dedup bit of the generated block pointer.

The consequence of this was that the ZIO_DDT_READ_PIPELINE was
never enabled for the block pointer during the scrub, and the
dedup ditto repair logic was never run.  Note that for demand
reads which don't rely on ddt_bp_create() the required pipeline
stages would be enabled and the repair performed.

This was resolved by unconditionally setting the dedup bit in
ddt_bp_create().  This way all codes paths which may need to
perform a repair from a block pointer generated from the dtt
entry will be able too.  The only exception is that the dedup
bit is cleared in ddt_phys_free() which is required to avoid
leaking space.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8270
2019-01-17 15:25:00 -08:00
Serapheim Dimitropoulos
419ba59145 Update vdev_is_spacemap_addressable() for new spacemap encoding
Since the new spacemap encoding was ported to ZoL that's no longer 
a limitation. This patch updates vdev_is_spacemap_addressable() 
that was performing that check.

It also updates the appropriate test to ensure that the same 
functionality is tested.  The test does so by creating pools that 
don't have the new spacemap encoding enabled - just the checkpoint
feature. This patch also reorganizes that same tests in order to 
cut in half its memory consumption.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8286
2019-01-16 15:06:20 -08:00
Brian Behlendorf
64bdf63f5c
ztest: split block reconstruction
Increase the default allowed number of reconstruction attempts.
There's not an exact right number for this setting.  It needs
to be set large enough to cover any realistic failure scenarios
and small enough to avoid stalling the IO pipeline and invoking
the dead man detection.

The current value of 256 was empirically determined to be too
low based on multi-day runs of ztest.  The fault injection code
would inject more damage than could be reconstructed given the
relatively small number of attempts.  However, in all observed
cases the block could be reconstructed using a slightly higher
limit.

Based on local testing increasing the default value to 4096 was
determined to strike the best balance.  Checking all combinations
takes less than 10s in the worst case, and has so far eliminated
the vast majority of false positives detected by ztest.  This
delay is roughly on par with how long retries may be performed
to a misbehaving HDD and was deemed to be reasonable.  Better to
err on the side of a brief delay rather than fail to reconstruct
the data.

Lastly, the -Y flag has been added to zdb to make it easy to try all
possible combinations when performing split block reconstruction.
For badly damaged blocks with 18 splits, they can be fully enumerated
within a few minutes.  This has been done to ensure permanent errors
are never incorrectly reported when ztest verifies the pool with zdb.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8271
2019-01-16 14:10:02 -08:00
Tom Caputi
5e7f3ace58 Fix zio leak in dbuf_read()
Currently, dbuf_read() may decide to create a zio_root which is
used as a parent for any child zios created in dbuf_read_impl().
However, if there is an error in dbuf_read_impl(), this zio is
never executed and ends up leaked. This patch simply ensures
that we always execute the root zio, even i it has no real work
to do.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8267
2019-01-15 12:23:40 -08:00
Brian Behlendorf
d611989fdc
Minor spelling corrections
Some minor spelling mistakes and typos.  No functional changes.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8272
2019-01-13 10:11:52 -08:00
Serapheim Dimitropoulos
61c3391acc Serialize ZTHR operations to eliminate races
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8229
2019-01-13 10:09:46 -08:00
Paul Zuchowski
83c796c5e9 zfs filesystem skipped by df -h
On full pool when pool root filesystem references very few bytes,
the f_blocks returned to statvfs is 0 but should be at least 1.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8253 
Closes #8254
2019-01-13 10:06:13 -08:00
Brian Behlendorf
6955b40138
Provide more flexible object allocation interface
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.

The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.

The zap_create_* functions have been updated to take advantage of
this new functionality.  The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().

The new symbols have been exported for use by Lustre.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8015
2019-01-10 14:37:43 -08:00
Tom Caputi
58769a4ebd Don't allow dnode allocation if dn_holds != 0
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8249
2019-01-10 14:36:23 -08:00
loli10K
0f5f23869a zfs receive and rollback can skew filesystem_count
This commit fixes a small issue which causes both zfs receive and
rollback operations to incorrectly increase the "filesystem_count"
property value.

This change also adds a new test group "limits" to the ZFS Test Suite
to exercise both filesystem_count/limit and snapshot_count/limit
functionality.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8232
2019-01-08 10:17:46 -08:00
asomers
f384c045d8 OpenZFS 8473 - scrub does not detect errors on active spares
Scrubbing is supposed to detect and repair all errors in the pool.
However, it wrongly ignores active spare devices. The problem can
easily be reproduced in OpenZFS at git rev 0ef125d with these
commands:

    truncate -s 64m /tmp/a /tmp/b /tmp/c
    sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
    sudo zpool replace testpool /tmp/a /tmp/c
    /bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
    sync
    sudo zpool scrub testpool
    zpool status testpool # Will show 0 errors, which is wrong
    sudo zpool offline testpool /tmp/a
    sudo zpool scrub testpool
    zpool status testpool # Will show errors on /tmp/c,
                          # which should've already been fixed

FreeBSD head is partially affected: the first scrub will detect
some errors, but the second scrub will detect more.  This same
test was run on Linux before applying the fix and the FreeBSD
head behavior was observed.

Authored by: asomers <asomers@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored by: Spectra Logic Corp

OpenZFS-issue: https://www.illumos.org/issues/8473
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/e20ec8879
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/554675ee
Closes #8251
2019-01-08 09:51:30 -08:00
George Wilson
c10d37dd9f zfs initialize performance enhancements
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8230
2019-01-07 11:03:08 -08:00
George Wilson
619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
Brian Behlendorf
65ca2c1eb9
Fix 'zpool remap' freeing race
The dmu_objset_remap_indirects_impl() logic depends on dnode_hold()
returning ENOENT for dnodes which will be freed and should be skipped.

This behavior can only be relied upon when taking a new hold and
while the caller has an open transaction.  This ensures that the
open txg cannot advance and that a concurrent free will end up
in the same txg (which is critical).  Relying on an existing hold
will not prevent dnode_free() from succeeding.

The solution is to take an additional dnode_hold() after assigning
the transaction.  This ensures the remap will never dirty the dnode
if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT).

Randomly set zfs_object_remap_one_indirect_delay_ms in ztest.  This
increases the likelihood of an operation racing with the remap.
Converted from ticks to milliseconds.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8215
2019-01-02 11:46:04 -08:00
Brad Lewis
3ec34e5527 OpenZFS 9284 - arc_reclaim_thread has 2 jobs
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping.  However, the code is still confusing and error-prone,
because this thread has two responsibilities.  We should instead
separate this into two threads each with their own responsibility:

 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

 2. keep enough free memory in the system, by calling
    `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
    `arc_available_memory()`.

Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by:  Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
2018-12-26 13:22:28 -08:00
Tom Caputi
00f198de6b Fix zfs_dirty_data_sync_percent documentation
In dfbe2675 zfs_dirty_data_sync was changed to a new tunable named
zfs_dirty_data_sync_percent. Unfortunately, the module parameter
documentation is the code was not updated accordingly. This patch
simply corrects that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8212
2018-12-18 14:47:33 -08:00
Tom Caputi
2a6078450d Fix zap_update() ASSERT from ztest
This patch simply removes an invalid assert from the zap_update()
function. The ASSERT is invalid because it does not hold the zap
lock from the time it fetches the old value to the time it confirms
that it is what it should be.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8209
2018-12-14 10:04:11 -08:00
Andriy Gapon
dc1c630b8a OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_core
Porting Notes:
* Additional changes to recv_rename_impl() were required due to
  encryption code not being merged in OpenZFS yet.
* libzfs_core python bindings (pyzfs) were updated to fully support
  both lzc_rename() and lzc_destroy()

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: loli10K <ezomori.nozomu@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9630
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63
Closes #8207
2018-12-14 09:49:45 -08:00
Tom Caputi
5aa95ba0d3 Fix resilver writes in vdev_indirect_io_start
This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
2018-12-13 14:18:48 -08:00
Olaf Faaland
fa61e72340 Rename macro ZFS_MINOR due to Lustre conflict
Macro ZFS_MINOR, introduced in commit a6cc9756 to record the chosen
static minor number for /dev/zfs, conflicts with an existing macro
in Lustre.  The lustre macro (along with _MAJOR, _PATCH, _FIX) is
used to record the zfsonlinux version Lustre is being built against.

Since the Lustre macro came first, and is used in past versions of
lustre at least going back to 2.10, it makes sense to rename the
macro in ZFS instead of doing so in Lustre which would require
backporting the patch.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8195
2018-12-10 16:52:49 -08:00
Prakash Surya
900d09b285 OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.

Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).

Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.

The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).

For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.

Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.

This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).

Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).

Porting Notes:
* Reconciled conflicts with the fastwrite feature.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-12-07 11:09:42 -08:00
Prakash Surya
53b1f5eac6 OpenZFS 9963 - Separate tunable for disabling ZIL vdev flush
Porting Notes:
* Add options to zfs-module-parameters(5) man page.
* zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since
  the latter doesn't get built for user space.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9963
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125
Closes #8186
2018-12-07 11:06:29 -08:00
George Wilson
18b14b17c8 OpenZFS 9993 - zil writes can get delayed in zio pipeline
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9993
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b
Closes #8185
2018-12-07 11:05:35 -08:00
Tom Caputi
d6496040d9 Ensure dsl scan prefetch queue is emptied
This patch simply ensures that scn->scn_prefetch_queue is emptied
before the kernel module is unloaded and when scanning completes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8178
2018-12-06 09:47:23 -08:00
Brian Behlendorf
78e2139467
Fix dnode_hold() freeing dnode behavior
Commit 4c5b89f59 refactored dnode_hold() and in the process
accidentally introduced a slight change in behavior which was
not intended.  The required behavior is that once the ZPL,
or other consumer, declares its intent to free a dnode then
dnode_hold() should immediately start failing.  This updated
code wouldn't return the failure until after it was freed.

When DNODE_MUST_BE_ALLOCATED is set it must return ENOENT, and
when DNODE_MUST_BE_FREE is set it must return EEXIST;

This issue was uncovered by ztest_remap() which attempted
to remap a freeing object which should have been skipped as
described by the comment in dmu_objset_remap_indirects_impl().

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8172
2018-12-05 09:29:33 -08:00
Tom Caputi
fedef6dd59 Fix ztest deadlock in spa_vdev_remove()
This patch corrects an issue where spa_vdev_remove() would
call spa_history_log_internal() while holding the spa config
lock. This function may decide to block until the next txg if
the current one seems too full. However, since the thread is
holding the config log, the txg sync thread cannot progress
and the system ends up deadlocked. This patch simply moves
all calls to spa_history_log_internal() outside of the config
lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8162
2018-12-04 09:54:05 -08:00
Brian Behlendorf
7c9a42921e
Detect IO errors during device removal
* Detect IO errors during device removal

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs.  Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool.  Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'.  Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable.  However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs.  Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled.  While it may be possible to reconstruct
the data after removal that cannot be guaranteed.  The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried.  But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900
Closes #8161
2018-12-04 09:37:37 -08:00
Tom Caputi
c40a1124e1 Fix consistency of ztest_device_removal_active
ztest currently uses the boolean flag ztest_device_removal_active
to protect some tests that may not run successfully if they occur
at the same time as ztest_device_removal(). Unfortunately, in the
event that ztest is in the middle of a device removal when it
decides to issue a SIGKILL, the device removal will be
automatically restarted (without setting the flag) when the pool
is re-imported on the next run. This patch corrects this by
ensuring that any in-progress removals are completed before running
further tests after the re-import.

This patch also makes a few small changes to prevent race conditions
involving the creation and destruction of spa->spa_vdev_removal,
since this field is not protected by any locks. Some checks that
may run concurrently with setting / unsetting this field have been
updated to check spa->spa_removing_phys.sr_state instead. The most
significant change here is that spa_removal_get_stats() no longer
accounts for in-flight work done, since that could result in a NULL
pointer dereference.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8105
2018-11-28 20:47:09 -08:00
LOLi
c71c8c715b zfs_dbgmsg() is not safe from every context
This commit reverts to using printk() instead of zfs_dbgmsg() to log
messages in vdev_disk_error(): this is necessary because the latter can
be called from interrupt context where we are not allowed to sleep.
Unfortunately zfs_dbgmsg() performs its allocations calling kmalloc()
with the KM_SLEEP flag which may result in the following oops:

   BUG: scheduling while atomic: swapper/4/0/0x10000100
	Call Trace:
	<IRQ>  [<0>] dump_stack+0x19/0x1b
	...
	[<0>] spl_kmem_alloc+0xdf/0x140 [spl] <-- kmem_alloc(size, KM_SLEEP)
	[<0>] __dprintf+0x69/0x150 [zfs]
	[<0>] ? kmem_cache_free+0x1e2/0x200
	[<0>] vdev_disk_error.part.15+0x5f/0x70 [zfs]
	[<0>] vdev_disk_io_flush_completion+0x48/0x70 [zfs]
	[<0>] bio_endio+0x67/0xb0
	[<0>] blk_update_request+0x90/0x360
	...
	[<0>] scsi_finish_command+0xdc/0x140
	[<0>] scsi_softirq_done+0x132/0x160
	[<0>] blk_done_softirq+0x96/0xc0
	[<0>] __do_softirq+0xf5/0x280
	[<0>] call_softirq+0x1c/0x30
	[<0>] do_softirq+0x65/0xa0
	[<0>] irq_exit+0x105/0x110
	[<0>] do_IRQ+0x56/0xf0
	[<0>] common_interrupt+0x162/0x162
	<EOI>  [<0>] ? cpuidle_enter_state+0x54/0xd0
	[<0>] cpuidle_idle_call+0xde/0x230
	[<0>] arch_cpu_idle+0xe/0xb0
	[<0>] cpu_startup_entry+0x14a/0x1e0
	[<0>] start_secondary+0x1f7/0x270
	[<0>] start_cpu+0x5/0x14

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8137 
Closes #8150
2018-11-28 11:29:57 -08:00
Tom Caputi
cef48f14da Remove races from scrub / resilver tests
Currently, several tests in the ZFS Test Suite that attempt to
test scrub and resilver behavior occasionally fail. A big reason
for this is that these tests use a combination of zinject and
zfs_scan_vdev_limit to attempt to slow these operations enough
to verify their test commands. This method works most of the time,
but provides no guarantees and leads to flaky behavior. This patch
adds a new tunable, zfs_scan_suspend_progress, that ensures that
scans make no progress, guaranteeing that tests can be run without
racing.

This patch also changes zfs_remove_max_bytes_pause to match this
new tunable. This provides some consistency between these two
similar tunables and ensures that the tunable will not misbehave
on 32-bit systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8111
2018-11-28 10:12:08 -08:00
LOLi
c8fd652ce7 Fix coverity defects: CID 184285
CID 184285: Read from pointer after free (USE_AFTER_FREE)

This patch fixes an use-after-free in vdev_config_generate_stats()
moving the kmem_free() call at the end of the function.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8120
2018-11-11 18:09:00 -08:00
loli10K
d48091de81 zed: detect and offline physically removed devices
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.

We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1537
Closes #7926
2018-11-09 11:17:24 -08:00
Tony Hutter
ad796b8a3b Add zpool status -s (slow I/Os) and -p (parseable)
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.

 	NAME         STATE     READ WRITE CKSUM  SLOW
 	testpool     ONLINE       0     0     0     -
	  mirror-0   ONLINE       0     0     0     -
 	    loop0    ONLINE       0     0     0    20
 	    loop1    ONLINE       0     0     0     0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7756
Closes #6885
2018-11-08 16:47:24 -08:00
George Melikov
877d925a9e Update zfs_admin_snapshot value (disabled)
It's disabled by default, update code and tests to reflect
the documentation.

Minor cleanup in delegate_common.kshlib.

Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #8045
2018-11-08 16:17:12 -08:00
Tom Caputi
20eb30d08e Fix divide by zero during indirect split damage
This patch simply ensures that vdev_indirect_splits_damage()
cannot hit a divide by zero exception if a split has no
children with valid data. The normal reconstruction code
path in vdev_indirect_reconstruct_io_done() already has this
check.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8086
2018-11-07 15:44:56 -08:00
Tom Caputi
fde25c0a87 Fix dirtying vdev config on with RO spa
This patch simply corrects an issue where vdev_dtl_reassess()
could attempt to dirty the vdev config even when the spa was
not elligable for writing.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8085
2018-11-07 15:44:14 -08:00
Tom Caputi
f44ad9297d Replay logs before starting ztest workers
This patch ensures that logs are replayed on all datasets prior
to starting ztest workers. This ensures that the call to
vdev_offline() a log device in ztest_fault_inject() will not fail
due to the log device being required for replay.

This patch also fixes a small issue found during testing where
spa_keystore_load_wkey() does not check that the dataset specified
is an encryption root. This check was present in libzfs, however.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8084
2018-11-07 15:40:24 -08:00
Tom Caputi
ac53e50f79 Fix vdev removal finishing race
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6900 
Closes #8083
2018-11-07 15:38:10 -08:00
Tom Caputi
4021ba4cfa Make vdev_set_deferred_resilver() recursive
vdev_clear() can call vdev_set_deferred_resilver() with a
non-leaf vdev to setup a deferred resilver. However, this
function is currently written to only handle leaf vdevs.
This bug was introduced with deferred resilvers in 80a91e74.
This patch makes this function recursive so that it can find
appropriate vdevs to resilver and set vdev_resilver_deferred
on them.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #7732
Closes #8082
2018-11-07 15:33:17 -08:00
Brian Behlendorf
09b85f2ded
ztest: reduce gangblock creation
In order to validate the gang block code ztest is configured to
artificially force a fraction of large blocks to be written as
gang blocks.  The default setting chosen for this was to
write 25% of all blocks 32k or larger using gang blocks.

The confluence of an unrealistically large number of gang blocks,
the aggressive fault injection done by ztest, and the split
segment reconstruction logic introduced by device removal has
resulted in the following type of failure:

  zdb -bccsv -G -d ... exit code 3

Specifically, zdb was unable to open the pool because it was
unable to reconstruct a damaged block.  Manual investigation
of multiple failures clearly showed that the block could be
reconstructed.  However, due to the large number of damaged
segments (>35) it could not be done in the allotted time.

Furthermore, the large number of gang blocks was determined
to be the reason for the unrealistically large number of
damaged segments.  In order to make this situation less
likely, this change both increases the forced gang block
size to 64k and reduces the frequency to 3% of blocks.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8080
2018-11-05 11:53:49 -08:00
Don Brady
e89f1295d4 Add libzutil for libzfs or libzpool consumers
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c).  This
removes the need for utilities to link against both libzpool and
libzfs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8050
2018-11-05 11:22:33 -08:00
Matthew Ahrens
9553c533a6 bpobj_enqueue_subobj() should copy small subobj's
When we delete a snapshot, we consolidate some bpobj's together because
we no longer need to keep their entries in separate buckets.  This is
done in constant time by including the "sub" bpobj by reference in the
parent bpobj.

After many snapshots have been deleted, we may have many sub-bpobj's.
Usually, most sub-bpobj's don't contain many BP's.  Compared to this
small payload, the sub-bpobj is relatively heavyweight since it is a
object in the MOS.  A common scenario on a long-lived pool is for the
vast majority of MOS objects to be small sub-bpobj's.

To improve this situation, when consolidating bpobj's together,
bpobj_enqueue_subobj() can copy the contents of small bpobj's into the
parent, and then delete the enqueued bpobj, rather than including it by
reference.  Since this copying is limited in size (to one block), the
consolidation is still constant time, though with a larger constant due
to reading in the one block of the enqueued bpobj.

This idea and mechanism are similar to how we handle "sub-subobj's".
When including a sub-bpobj by reference, if the sub-bpobj itself has
less than a block of sub-sub-bpobj's, the list of sub-sub-bpobj's is
copied to the parent bpobj's list of sub-bpobj's.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8053 
Issue #7908
2018-10-31 11:58:17 -05:00
Tom Caputi
8cb119e3dc Fix 2 small bugs with cached dsl_scan_phys_t
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:41 -07:00
Tom Caputi
7ab96299e5 Fix ENXIO from spa_ld_verify_logs() in ztest
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:33 -07:00
Tom Caputi
4a7eb69a5a Fix ztest deadman panic with indirect vdev damage
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:31 -07:00
Tom Caputi
5e0bd0ae05 Fix issue with scanning dedup blocks as scan ends
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:15 -07:00
Tom Caputi
a783dd9684 Fix lock inversion in txg_sync_thread()
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:02 -07:00
Tom Caputi
ab4c009e3d Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:50 -07:00
Tom Caputi
c04812f964 Fix ASSERT in zil_create() during ztest
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:40 -07:00
Tom Caputi
9410257800 Fix random ztest_deadman_thread failures
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:21 -07:00
Paul Dagnelie
ae3d849142 OpenZFS 9688 - aggsum_fini leaks memory
Porting Notes:
- Most of these fixes were applied in the original 37fb3e43
  commit when this change was ported for Linux.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9688
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bf2d68be
Closes #8042
2018-10-19 12:08:03 -07:00
Serapheim Dimitropoulos
9b2266e3d8 OpenZFS 9682 - page fault in dsl_async_clone_destroy() while opening pool
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9682
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ade2c82828
Closes #8037
2018-10-19 12:06:21 -07:00
Serapheim Dimitropoulos
ee900344f2 OpenZFS 9690 - metaslab of vdev with no space maps was flushed during removal
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9690
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e75ba6826
Closes #8039
2018-10-19 12:05:03 -07:00
Matthew Ahrens
d637db98e1 OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename()
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9681
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7
Closes #8041
2018-10-19 12:02:28 -07:00
Tom Caputi
80a91e7469 Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.

This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee 
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
2018-10-18 21:06:18 -07:00
Matthew Ahrens
49394a7708 Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
2018-10-17 10:31:38 -07:00
Matthew Ahrens
5fbf85c4e2 Linux does not HAVE_DNLC
Since Linux does not have the Directory Name Lookup Cache, we don't need
the code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8031
2018-10-17 10:30:08 -07:00
Paul Dagnelie
d52d80b700 Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.

This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.

This functionality will be used by the Redacted Send/Receive feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
2018-10-16 11:15:04 -07:00
ilbsmart
779a6c0bf6 deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
2018-10-16 11:11:24 -07:00
Matthew Ahrens
0aa5916a30
OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects (#7979)
OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects

We're leaking the dd_clones objects in dsl_dir_destroy_sync.  This bug
appears to have been around forever.  Thankfully the amount of space
typically involved is tiny.

In addition this adds a mechanism in ZDB to find objects in the MOS
which are leaked (not referenced anywhere).

Porting notes:
* Added dd_crypto_obj to ZDB MOS object leak tracking

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://illumos.org/issues/9847
Closes #7979
2018-10-12 11:28:26 -07:00
Brian Behlendorf
27f80e85c2 Improved error handling for extreme rewinds
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing.  While this is the most common cause, it's not the only
cause.  Attemping to access a damaged ZAP will result in other
errors.

The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import.  Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully.  Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
2018-10-12 11:24:04 -07:00
Brian Behlendorf
d6c745830f Revert "Allow ECKSUM in vdev_checkpoint_sm_object()"
This reverts commit e927fc8a52.

Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7921
2018-10-12 11:24:04 -07:00
Matt Ahrens
5d43cc9a59 OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t.  However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.

We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.

Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
  This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
  check to prevent needless warnings when integrating changes
  which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
  needed to defer the cv_destroy() which was called immediately
  after cv_broadcast().  With d2733258 this should be safe but
  if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
  to this change in order to make it easier to apply OpenZFS 9689.
  - d88895a0, which removed the dummy znode from zvol_state
  - e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function.  Preserved the
  rangelock free list.  The cv_destroy() function will block waiting
  for all processes in cv_wait() to be scheduled and drop their
  reference.  This is done to ensure it's safe to free the condition
  variable.  However, blocking while holding the rl->rl_lock mutex
  can result in a deadlock on Linux.  A free list is introduced to
  defer the cv_destroy() and kmem_free() until after the mutex is
  released.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
2018-10-11 10:19:33 -07:00
Paul Dagnelie
0391690583 Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
2018-10-09 14:05:13 -07:00
Brian Behlendorf
5e8ff25644 Fix arc_release() refcount
Update arc_release to use arc_buf_size().  This hunk was accidentally
dropped when porting compressed send/recv, 2aa34383b.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8000
2018-10-09 10:10:26 -07:00
Brian Behlendorf
d7e4b30a67 Add zfs_refcount_transfer_ownership_many()
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred.  Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;

This issue may occur when using encryption with refcount debugging
enabled.  An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder.  When unsharing the buffer the
p_abd should be transferred.

This issue does not impact production builds because refcount holders
are not tracked.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
2018-10-09 10:05:48 -07:00
Matthew Ahrens
4cbde2ecbf Create /proc/sys/kernel/spl/gitrev with git hash
The existing mechanisms for determining what code is running in the
kernel do not always correctly report the git hash.  The versions
reported there do not reflect changes made since `configure` was run
(i.e. incremental builds do not update the version) and they are
misleading if git tags are not set up properly.  This applies to
`modinfo zfs`, `dmesg`, and `/sys/module/zfs/version`.

There are complicated requirements on how the existing version is
generated.  Therefore we are leaving that alone, and adding a new
mechanism to record and retrieve the git hash:
`cat /proc/sys/kernel/spl/gitrev`

The gitrev is re-generated at compile time, when running `make`
(including for incremental builds).  The value is the output of `git
describe` (or "unknown" if not in a git repo or there are uncommitted
changes).

We're also removing /proc/sys/kernel/spl/version, which was never very
useful.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7931 
Closes #7965
2018-10-08 21:57:02 -07:00
Matthew Ahrens
dfbe267503 OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
  changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
2018-10-04 13:13:28 -07:00
Brad Lewis
c955398b52 OpenZFS 9677 - panic from zio_write_gang_block()
Panic from zio_write_gang_block() when creating dump device
on fragmented rpool.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7341a7d
Closes #7975
2018-10-03 09:50:06 -07:00
Tom Caputi
52ce99dd61 Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.

In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
2018-10-03 09:47:11 -07:00
Jerry Jelinek
f65fbee1e7 OpenZFS 9700 - ZFS resilvered mirror does not balance reads
Authored by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/82f63c3c
Closes #7973
2018-10-02 16:18:24 -07:00
Tim Schumacher
424fd7c3e0 Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.

To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
2018-10-01 10:42:05 -07:00
Brian Behlendorf
1258bd778e
Refine split block reconstruction
Due to a flaw in 4589f3ae the number of unique combinations
could be calculated incorrectly.  This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest.  It is implemented such it will never
make a known recoverable block unrecoverable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900 
Closes #7934
2018-10-01 10:36:34 -07:00
John Gallagher
d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Tim Schumacher
c13060e478 Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.

Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.

Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885 
Closes #7932
2018-09-26 10:29:26 -07:00
Brian Behlendorf
7a23c81342
Fix small sysfs leak
When zfs_kobj_init() is called with an attr_cnt of 0 only the
kobj->zko_default_attrs is allocated.  It subsequently won't
get freed in zfs_kobj_release since the free is wrapped in
a kobj->zko_attr_count != 0 conditional.

Split the block in zfs_kobj_release() to make sure the
kobj->zko_default_attrs are freed in this case.

Additionally, fix a minor spelling mistake and typo in
zfs_kobj_init() which could also cause a leak but in practice
is almost certain not to fail.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7957
2018-09-26 09:50:58 -07:00
Brian Behlendorf
e897a23eb1
Fix statfs(2) for 32-bit user space
When handling a 32-bit statfs() system call the returned fields,
although 64-bit in the kernel, must be limited to 32-bits or an
EOVERFLOW error will be returned.

This is less of an issue for block counts since the default
reported block size in 128KiB. But since it is possible to
set a smaller block size, these values will be scaled as
needed to fit in a 32-bit unsigned long.

Unlike most other filesystems the total possible file counts
are more likely to overflow because they are calculated based
on the available free space in the pool. In order to prevent
this the reported value must be capped at 2^32-1. This is
only for statfs(2) reporting, there are no changes to the
internal ZFS limits.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7927 
Closes #7122 
Closes #7937
2018-09-24 17:11:25 -07:00
LOLi
dda5500853 vdev_disk_error() prints ASCII SOH to debug log
Currently vdev_disk_error() prepends its messages sent to the internal
ZFS debug log with KERN_WARNING, which is currently defined as follows:

   #define KERN_SOH      "\001"
   #define KERN_WARNING  KERN_SOH "4"

Since "\001" (ASCII Start Of Header) is not printable this results in
weird characters displayed when inspecting the debug log. This commit
simply removes this superfluous prefix passed to zfs_dbgmsg().

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7936
2018-09-21 09:42:42 -07:00
LOLi
7522a26077 Add limits to spa_slop_shift tunable
This change adds limits to the possible spa_slop_shift values set via
the sysfs interface. Accepted values are from a minimum of 1 to a
maximum of 31 (inclusive): these limits are based on the following
values observed on a 128PB file-vdev test pool:

spa_slop_shift=1, spa_get_slop_space=63.5PiB
spa_slop_shift=2, spa_get_slop_space=31.8PiB
spa_slop_shift=3, spa_get_slop_space=15.9PiB
spa_slop_shift=4, spa_get_slop_space=7.9PiB
spa_slop_shift=5, spa_get_slop_space=4PiB
spa_slop_shift=6, spa_get_slop_space=2PiB
...
spa_slop_shift=25, spa_get_slop_space=4GiB
spa_slop_shift=26, spa_get_slop_space=2GiB
spa_slop_shift=27, spa_get_slop_space=1016MiB
spa_slop_shift=28, spa_get_slop_space=508MiB
spa_slop_shift=29, spa_get_slop_space=254MiB
spa_slop_shift=30, spa_get_slop_space=128MiB
spa_slop_shift=31, spa_get_slop_space=128MiB
spa_slop_shift=32, spa_get_slop_space=128MiB

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7876 
Closes #7900
2018-09-20 21:10:12 -07:00
Roman Strashkin
733b5722b4 zpool split can create a corrupted pool
Added vdev_resilver_needed() check to verify VDEVs are fully
synced, so that after split the new pool will not be corrupted.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes #7865
Closes #7881
2018-09-12 18:14:42 -07:00
Don Brady
73a5ec30bf Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module).  When not built as a module add
the sysfs entries under /sys/fs/zfs/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868 
Closes #7872
2018-09-06 21:44:52 -07:00
Don Brady
e7b677aa5d Fix zfs_sysfs_live test failure
The ZTS zfs_sysfs_live test fails occasionally due to an uninitialized
string on an error path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7869
2018-09-06 17:36:00 -07:00
Don Brady
cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Chris Siebenmann
cfa37548eb Correctly handle errors from kern_path
As a regular kernel function, kern_path() returns errors as negative
errnos, such as -ELOOP. zfsctl_snapdir_vget() must convert these into
the positive errnos used throughout the ZFS code when it returns them
to other ZFS functions so that the ZFS code properly sees them as
errors.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Closes #7764
Closes #7864
2018-09-04 22:26:56 -07:00
Rich Ercolani
0405eeea6a Added recalculation of ARC stats mid-eviction
Re-adds a recalculation step for the ARC stats after the MRU
eviction so that we don't pathologically attempt to evict the MFU.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Mark Johnston <markj@freebsd.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #7855
2018-09-04 22:15:14 -07:00
Brian Behlendorf
27ca030fa6 Revert "Update zfs_admin_snapshot default value (disabled)"
This reverts commit a6214a0ae9.
Disabling zfs_admin_snapshot by default results in multiple ZTS
tests failing which depend on this functionality.  Revert this
change until the relevant test cases can be updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7838
2018-09-04 22:00:21 -07:00
George Melikov
a6214a0ae9 Update zfs_admin_snapshot default value (disabled)
It's disabled by default, update code to reflect
the documentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #7838
2018-09-04 17:21:24 -07:00
mav
c197a77c3c OpenZFS 9751 - Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks.  Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment.  Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by:	iXsystems, Inc.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9751
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/8253837ac3
Closes #7857
2018-09-02 12:22:45 -07:00
mav
e38afd34c3 OpenZFS 9738 - Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9738
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/63e7138
Closes #7858
2018-09-02 12:21:54 -07:00
Don Brady
b83a0e2dc1 Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments 
for the ioc commands that rely on nvpair input arguments (i.e. non 
legacy commands from libzfs_core). Automatically verify the expected 
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on 
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE, 
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
2018-09-02 12:14:01 -07:00
Don Brady
e8bcb693d6 Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature 
and property attributes. The primary consumer of this information 
is user processes, like the zfs CLI, that need to know what the 
current loaded ZFS module supports. The libzfs binary will consult 
this information when instantiating the zfs and zpool property 
tables and the pool features table.

This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
  features.runtime
  features.pool
  properties.dataset
  properties.pool

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
2018-09-02 12:09:53 -07:00
Brian Behlendorf
e927fc8a52
Allow ECKSUM in vdev_checkpoint_sm_object()
The checkpoint space map object may not be accessible from the
vdev's ZAP when it has been damaged.  This may be the case when
performing an extreme rewind when importing the pool.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7853
2018-08-31 14:20:34 -07:00
Matthew Ahrens
adb726eb0e clean up __dbuf_hold_impl
We can simplify the dbuf_hold code by allocating dbuf_hold_arg_t's on
demand, rather than allocating a big array of them up front.  While this
can occasionally increase the number of allocations, typically only one
allocation is needed since the indirect block is already cached.

The performance test suite gets the same results with this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7841
2018-08-31 10:16:54 -07:00
Tom Caputi
c3bd3fb4ac OpenZFS 9403 - assertion failed in arc_buf_destroy()
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.

Porting notes:
* The ability to zinject decompression errors has been added, but
  this only works at the zio_decompress() level, where we have all
  of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
  functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
  ztest for further test coverage. Although this did uncover a few
  low priority issues, this unfortuantely also causes ztest to
  ASSERT in many locations where the code is working correctly since
  it is designed to fail on IO errors. Developers can manually set
  this variable with the '-o' option to find and debug issues.

Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Tom Caputi <tcaputi@datto.com>

OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
2018-08-29 11:33:33 -07:00
Tom Caputi
47ab01a18f Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-27 10:16:28 -07:00
Tom Caputi
8c4fb36a24 Small rework of txg_list code
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7795
2018-08-27 10:16:01 -07:00
Brian Behlendorf
a584ef2605
Direct IO support
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224 
Closes #7823
2018-08-27 10:04:21 -07:00
LOLi
c434d8806c Stack overflow when destroying deeply nested clones
Destroy operations on deeply nested chains of clones can overflow
the stack:

        Depth    Size   Location    (221 entries)
        -----    ----   --------
  0)    15664      48   mutex_lock+0x5/0x30
  1)    15616       8   mutex_lock+0x5/0x30
...
 26)    13576      72   dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs]
 27)    13504      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
 28)    13432      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
...
185)     2128      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
186)     2056      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
187)     1984      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
188)     1912     136   dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs]
189)     1776      16   dsl_destroy_snapshot_check+0x0/0x90 [zfs]
...
218)      304     128   kthread+0xdf/0x100
219)      176      48   ret_from_fork+0x22/0x40
220)      128     128   kthread+0x0/0x100

Fix this issue by converting dsl_dataset_remove_clones_key() from
recursive to iterative.

Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7279 
Closes #7810
2018-08-22 11:03:31 -07:00
Tim Chase
2711b1d05f s/VERIFY/VERIFY3S in vdev_checkpoint_sm_object
Using VERIFY3S allows to view the unexpected error value in the system
log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #7809 
Closes #7818
2018-08-21 16:08:14 -07:00
Tom Caputi
149ce888bb Fix issues with raw receive_write_byref()
This patch fixes 2 issues with raw, deduplicated send streams. The
first is that datasets who had been completely received earlier in
the stream were not still marked as raw receives. This caused
problems when newly received datasets attempted to fetch raw data
from these datasets without this flag set.

The second problem was that the arc freeze checksum code was not
consistent about which locks needed to be held while performing
its asserts. The proper locking needed to run these asserts is
actually fairly nuanced, since the asserts touch the linked list
of buffers (requiring the header lock), the arc_state (requiring
the b_evict_lock), and the b_freeze_cksum (requiring the
b_freeze_lock). This seems like a large performance sacrifice and
a lot of unneeded complexity to verify that this relatively small
debug feature is working as intended, so this patch simply removes
these asserts instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7701
2018-08-20 11:03:56 -07:00
Serapheim Dimitropoulos
a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
Brian Behlendorf
4338c5c06f
Fix traverse_impl() kmem leak
The error path must free the memory allocated by this function or
it will be leaked.  In practice, this would leak only a few bytes
of memory under rare circumstances and thus is unlikely to have
caused any real problems.  This issue was caught by the kmemleak.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7791
2018-08-15 09:53:44 -07:00
Tom Caputi
1fff937a4c Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:49:19 -07:00
Tom Caputi
d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Tomohiro Kusumi
fe8a7982ca Fix comment on calculating blkid
Fix comment on calculating blkid at level n within dnode's blkptrs.
"(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor
in this division.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7768
2018-08-13 13:33:47 -07:00
LOLi
c8c308362c Allow inherited properties in zfs_check_settable()
This change modifies how 'checksum' and 'dedup' properties are verified
in zfs_check_settable() handling the case where they are explicitly
inherited in the dataset hierarchy when receiving a recursive send
stream.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7755 
Closes #7576 
Closes #7757
2018-08-03 14:56:25 -07:00
Don Brady
fc1ecd16d7 zfs_ioc_unload_key can drop extra spa ref
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7759
2018-08-03 14:50:51 -07:00
Matthew Ahrens
62840030a7 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
2018-08-02 15:51:45 -07:00
John Gallagher
499b5497cb Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866 
Closes #7761
2018-08-02 14:03:56 -07:00
George Wilson
3d503a76e8 Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.

This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
2018-08-02 10:21:48 -07:00
Rohan Puri
fd7265c646 Fix deadlock between zfs umount & snapentry_expire
zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a
writer and calls zfsctl_snapshot_unmount_cancel(), which waits
for snapentry_expire() if present (when snap is automounted).
This snapentry_expire() itself then waits for zfs_snapshot_lock
as a reader, resulting in a deadlock.

The fix is to only hold the zfs_snapshot_lock over the tree
lookup and removal.  After a successful lookup the lock can
be dropped and zfs_snapentry_t will remain valid until the
reference taken by the lookup is released.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Closes #7751
Closes #7752
2018-08-01 15:00:02 -07:00
Paul Dagnelie
492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Don Brady
dae3e9ea21 OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
  zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
  were intentially not added as tunable module parameters.

OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef
Closes #7749
2018-07-30 11:30:41 -07:00
Serapheim Dimitropoulos
6b64382b17 OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation

While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.

An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists.  Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg

Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.

= What this patch does

In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).

= Performance Analysis

To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.

== Results (before patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        5.3
user        0.4
sys         2.3

[Trial 5]
real        8.2
user        0.4
sys         2.4

[Trial 6]
real        6.0
user        0.5
sys         2.3
```

== Results (after patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        4.9
user        0.0
sys         0.9

[Trial 5]
real        3.8
user        0.0
sys         0.9

[Trial 6]
real        3.6
user        0.0
sys         0.9
```

== Analysis

The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.

From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts.  Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().

From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).

Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
  zpool_do_events_nvprint().

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
2018-07-30 11:30:03 -07:00
Matthew Ahrens
1897bc0d48 OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block
Follow up commit for OpenZFS 9438.  See the OpenZFS-issue link below
for a complete analysis.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9439
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/779220d
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:28:09 -07:00
Paul Dagnelie
21d48b5eac OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth times
As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is
yet another hole birth issue. In this one, if a block is entirely holes,
but the birth times are not all the same, we lose that information by
creating one hole with the current txg as its birth time.

The ZoL PR's fix approach is incorrect. Ultimately, the problem here is
that when you truncate and write a file in the same transaction group,
the dbuf for the indirect block will be zeroed out to deal with the
truncation, and then written for the write. During this process, we will
lose hole birth time information for any holes in the range. In the case
where a dnode is being freed, we need to determine whether the block
should be converted to a higher-level hole in the zio pipeline, and if
so do it when the dnode is being synced out.

Porting Notes:
* The DMU_OBJECT_END change in zfs_znode.c was already applied.
* Added test cases from #5675 provided by @rincebrain for hole_birth
  issues.  These test cases should be pushed upstream to OpenZFS.
* Updated mk_files which is used by several rsend tests so the
  files created are a little more interesting and may contain holes.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9438
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:27:49 -07:00
Toomas Soome
5fadb7fb0c OpenZFS 8906 - uts: illumos rootfs should support salted cksum
Porting notes:
* As of grub-2.02 these checksums are not supported.  However, as
  pointed out in #6501 there are alternatives such as EFISTUB which
  work and have no such restriction.  A warning was added to the
  checksum property section of the zfs.8 man page.

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
2018-07-27 08:35:28 -07:00
Matthew Ahrens
3a549dc7a1 OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
  and dmu_object_alloc_dnsize() functions to use a common shared
  dmu_object_alloc_impl() function.

OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b
Closes #7712
2018-07-25 14:11:35 -07:00
Feng Sun
750e1f88d3 Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
2018-07-25 09:52:27 -07:00
Matthew Ahrens
802b1a7b3b OpenZFS 9338 - moved dnode has incorrect dn_next_type
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

While investigating a different problem, I noticed that moved dnodes
(those processed by dnode_move_impl() via kmem_move()) have an incorrect
dn_next_type. This could cause the on-disk dn_type to be changed to an
invalid value. The fix to copy the dn_next_type in dnode_move_impl().

Porting notes:
* For the moment this potential issue cannot occur on Linux since
  the SPL does not provide the kmem_move() functionality.

OpenZFS-issue: https://illumos.org/issues/9338
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13
Closes #7715
2018-07-24 17:10:42 -07:00
Tom Caputi
b7ddeaef3d Refactor arc_hdr_realloc_crypt()
The arc_hdr_realloc_crypt() function is responsible for converting
a "full" arc header to an extended "crypt" header and visa versa.
This code was originally written with a bcopy() so that any new
members added to arc headers would automatically be included
without requiring a code change. However, in practice this (along
with small differences in kmem_cache implementations between
various platforms) has caused a number of hard-to-find problems in
ports to other operating systems. This patch solves this problem
by making all member copies explicit and adding ASSERTs for fields
that cannot be set during the transfer. It also manually resets the
old header after the reallocation is finished so it can be properly
reallocated and reused.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7711
2018-07-24 12:20:04 -07:00
Steven Noonan
863522b1f9 dsl_scan_scrub_cb: don't double-account non-embedded blocks
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5,
which was an incorrect port of this OpenZFS commit:

    https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720 
Closes #7738
2018-07-24 09:33:56 -07:00
Brian Behlendorf
d441e85dd7
Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report the maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
2018-07-23 15:40:15 -07:00
Matthew Ahrens
2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Don Brady
e4e94ca315 OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1
Closes #7700
2018-07-11 15:55:48 -07:00
Andriy Gapon
e902ddb0f8 OpenZFS 9479 - fix wrong format specifier for vdev_id
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9479
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/20aa447c
Closes #7699
2018-07-11 15:53:02 -07:00
Brian Behlendorf
ac09630d8b
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Brian Behlendorf
33a19e0fd9
Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time.  Otherwise
fallback to using the sa_copy_data() function.

The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space.  In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642 
Closes #7684
2018-07-11 13:10:40 -07:00
Matthew Ahrens
2dca37d8dc OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"
Ztest failed with the following crash.

    ::status

    debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
    file: /usr/bin/amd64/ztest
    initial argv: /usr/bin/amd64/ztest
    threading model: raw lwps
    status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
    panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
    value type 'function' for key 'error')

    ::stack

    libc.so.1`_lwp_kill+0xa()
    libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0)
    libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0)
    libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00)
    0xfffffd7ffaa28146()
    0xfffffd7ffaa0a109()
    libzpool.so.1`luaD_throw+0x86(3011a48, 2)
    0xfffffd7ffa9350d3()
    0xfffffd7ffa93e3f1()
    libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3)
    libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0)
    libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450)
    libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30)
    libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750)
    libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0)
    ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1)
    ztest_dataset_destroy+0x53(1)
    ztest_run+0x59f(fffffd7fff0e0498)
    main+0x7ff(1, fffffd7fffdffa88)
    _start+0x6c()

The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.

The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9424
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429
Closes #7696
2018-07-10 21:29:23 -07:00
Matthew Ahrens
00c405b4b5 OpenZFS 9454 - ::zfs_blkstats should count embedded blocks
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.

Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
  as part of the sequential resilver functionality.  Since phys_birth
  would be zero that case should never happen as described above.  This
  is confirmed by the code coverage analysis.  Remove the conditional
  to realign that aspect of this function with OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9454
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7
Closes #7697
2018-07-10 10:41:38 -07:00
Prakash Surya
ab11916583 OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout
Problem
=======

Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail.  When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue".  As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.

Solution
========

As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9456
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4e
Closes #7695
2018-07-10 10:25:14 -07:00
Serapheim Dimitropoulos
a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Serapheim Dimitropoulos
4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Tom Caputi
370bbf66ae Fix coverity defects: CID 176037
CID 176037: Uninitialized scalar variable

This patch fixes an uninitialized variable defect caught by
coverity and introduced in 69830602

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7667
2018-07-02 13:37:48 -07:00
Tom Caputi
da2feb42fb Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617 
Closes #7662
2018-06-28 14:55:11 -07:00
Tom Caputi
69830602de Raw receive fix and encrypted objset security fix
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.

Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7632
2018-06-28 09:20:34 -07:00
Serapheim Dimitropoulos
d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Serapheim Dimitropoulos
7637ef8d23 OpenZFS 9591 - ms_shift can be incorrectly changed
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded

According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.

As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes #7644
2018-06-21 09:35:26 -07:00
Matthew Ahrens
af43029484 Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace.  This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.

We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.

Note that we already support multiple taskq's with the same name.  This
happens when there are multiple pools.  In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
2018-06-20 14:07:50 -07:00
Brian Behlendorf
1c38ac61e1
Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212.  This change
resulted in the default elevator being used which can negatively
impact performance.

Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator.  Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().

Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269.  Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
2018-06-19 21:52:45 -07:00
Brian Behlendorf
6413c95fbd
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
Tom Caputi
cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
John Gallagher
917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Brian Behlendorf
7b98f0d91f
Linux compat 4.18: check_disk_size_change()
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().

In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held.  Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently.  The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.

The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.

The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed.  This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.

Updated maximum Linux version in META file.  The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7611
2018-06-15 15:05:21 -07:00
Matthew Ahrens
1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Paul Zuchowski
2ffd89fcb9 Wrong error message when removing log device
In the case where the pool is loaded without the crypto
keys necessary to playback the intent log, and log device
removal is attempted, a generic busy message is received.
Change the message to inform the user that the datasets
must be mounted.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7518
2018-06-07 18:07:29 -07:00
Brian Behlendorf
174bcd581d
Fix preemptible warning in aggsum_add()
In the new aggsum counters the CPU_SEQID macro should be surrounded by
kpreempt_disable)() and kpreempt_enable() calls to prevent a Linux
kernel BUG warning.  The addsum_add() function use the cpuid to
minimize lock contention when selecting a bucket, after selection
the bucket is protected by a mutex and it is safe to reschedule the
process to a different processor at any time.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7609
Closes #7610
2018-06-07 15:55:11 -07:00
Nathaniel Clark
fba33c3819 Don't panic on bad SA_MAGIC in sa_build_index
If sa_build_index() encounters a corrupt buffer, don't panic.
Add info to zfs ring buffer and return EIO.  This allows for a cleaner
error recovery path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Issue #6500 
Closes #7487
2018-06-07 09:51:56 -07:00
Tom Caputi
b405837a6c Update the correct abd in l2arc_read_done()
This patch fixes an issue where l2arc_read_done() would always
write data to b_pabd, even if raw encrypted data was requested.
This only occured in cases where the L2ARC device had a different
ashift than the main pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7586 
Closes #7593
2018-06-06 10:17:50 -07:00
Tom Caputi
e7504d7a18 Raw receive functions must not decrypt data
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7564 
Closes #7584 
Closes #7592
2018-06-06 10:16:41 -07:00
Paul Dagnelie
37fb3e4318 OpenZFS 8484 - Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752
Closes #7462
2018-06-06 09:35:59 -07:00
Tony Hutter
f0ed6c7448 Add pool state /proc entry, "SUSPENDED" pools
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7331 
Closes #7563
2018-06-06 09:33:54 -07:00
Serapheim Dimitropoulos
e48afbc4eb OpenZFS 9464 - txg_kick() fails to see that we are quiescing
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes

Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:

    0ms   311MB in  4114ms (95% p1)  75MB/s  544MB (76%)  336us   153ms     0ms
    0ms     8MB in    51ms ( 0% p1) 163MB/s  474MB (66%)  129us    34ms     0ms
    0ms   366MB in  4454ms (93% p1)  82MB/s  572MB (79%)  498us    20ms     0ms
    0ms   406MB in  5212ms (95% p1)  77MB/s  591MB (82%)  661us    37ms     0ms
    0ms   340MB in  5110ms (94% p1)  66MB/s  622MB (86%) 1048us    41ms     1ms
    0ms     3MB in    61ms ( 0% p1)  51MB/s  419MB (58%)   33us     0ms     0ms
    0ms   361MB in  3555ms (88% p1) 101MB/s  542MB (75%)  335us    40ms     0ms
    0ms   356MB in  4592ms (92% p1)  77MB/s  561MB (78%)  430us    89ms     1ms
    0ms    11MB in   129ms (13% p1)  90MB/s  507MB (70%)  222us    15ms     0ms
    0ms   281MB in  2520ms (89% p1) 111MB/s  542MB (75%)  334us    42ms     0ms
    0ms   383MB in  3666ms (91% p1) 104MB/s  557MB (77%)  411us   133ms     0ms
    0ms   404MB in  5757ms (94% p1)  70MB/s  635MB (88%) 1274us   123ms     2ms
    4ms   367MB in  4172ms (89% p1)  88MB/s  556MB (77%)  401us    51ms     0ms
    0ms    42MB in   470ms (44% p1)  90MB/s  557MB (77%)  412us    43ms     0ms
    0ms   261MB in  2273ms (88% p1) 114MB/s  556MB (77%)  407us    27ms     0ms
    0ms   394MB in  3646ms (85% p1) 108MB/s  552MB (77%)  393us   304ms     0ms
    0ms   275MB in  2416ms (89% p1) 113MB/s  510MB (71%)  200us    53ms     0ms
    0ms     9MB in    53ms ( 0% p1) 169MB/s  483MB (67%)  140us   100ms     1ms

The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:

        if (tx->tx_syncing_txg == 0 &&
        tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
        tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
        tx->tx_quiesced_txg <= tx->tx_synced_txg) {
        tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
        cv_broadcast(&tx->tx_quiesce_more_cv);
    }

The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:

1] We have an open txg A that had enough dirty data (more than
   zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
   a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
   (and because of A's dirty data) and saw that we are not currently syncing
   any txg and no one has requested quiescing so it requests one by bumping
   tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
   request has been sent to it so it immediately grabs B without letting it
   gather enough data, putting it in a quiesced state and opening a new txg C.

In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.

Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b
Closes #7587
2018-06-04 14:56:06 -07:00
Pavel Zakharov
8a393be353 OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44
Closes #7532
2018-06-04 14:54:20 -07:00
Matthew Ahrens
1a5b96b8ee OpenZFS 9329 - panic in zap_leaf_lookup() due to concurrent zapification
For the null pointer issue shown below, the solution is to initialize the
contents of the object before changing its type, so that concurrent accessors
will see it as non-zapified until it is ready for access via the ZAP.

    BAD TRAP: type=e (#pf Page fault) rp=ffffff00ff520440 addr=20 occurred
    in module "zfs" due to a NULL pointer dereference

    ffffff00ff520320 unix:die+df ()
    ffffff00ff520430 unix:trap+dc0 ()
    ffffff00ff520440 unix:cmntrap+e6 ()
    ffffff00ff520590 zfs:zap_leaf_lookup+46 ()
    ffffff00ff520640 zfs:fzap_lookup+a9 ()
    ffffff00ff5206e0 zfs:zap_lookup_norm+111 ()
    ffffff00ff520730 zfs:zap_contains+42 ()
    ffffff00ff520760 zfs:dsl_dataset_has_resume_receive_state+47 ()
    ffffff00ff520900 zfs:get_receive_resume_stats+3e ()
    ffffff00ff520a90 zfs:dsl_dataset_stats+262 ()
    ffffff00ff520ac0 zfs:dmu_objset_stats+2b ()
    ffffff00ff520b10 zfs:zfs_ioc_objset_stats_impl+64 ()
    ffffff00ff520b60 zfs:zfs_ioc_objset_stats+33 ()
    ffffff00ff520bd0 zfs:zfs_ioc_dataset_list_next+140 ()
    ffffff00ff520c80 zfs:zfsdev_ioctl+4d7 ()
    ffffff00ff520cc0 genunix:cdev_ioctl+39 ()
    ffffff00ff520d10 specfs:spec_ioctl+60 ()
    ffffff00ff520da0 genunix:fop_ioctl+55 ()
    ffffff00ff520ec0 genunix:ioctl+9b ()
    ffffff00ff520f10 unix:brand_sys_sysenter+1c9 ()

Porting Notes:
* DMU_OT_BYTESWAP conditional in zap_lockdir_impl() kept.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9329
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e8e0f97
Closes #7578
2018-05-31 10:53:49 -07:00
Matthew Ahrens
d2a12f9e2a OpenZFS 9328 - zap code can take advantage of c99
The ZAP code was written before we allowed c99 in the Solaris kernel. We
should change it to take advantage of being able to declare variables where
they are first used. This reduces variable scope and means less scrolling
to find the type of variables.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9328
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/76ead05
Closes #7578
2018-05-31 10:53:11 -07:00
Sara Hartse
74d42600d8 zpool reopen should detect expanded devices
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.

Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes #7546 
Issue #7582
2018-05-31 10:36:37 -07:00
Tony Hutter
c26cf0966d Fix zio->io_priority failed (7 < 6) assert
This fixes an assert in vdev_queue_change_io_priority():

  VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
  PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7566 
Closes #7542
2018-05-29 18:13:48 -07:00
Brian Behlendorf
93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Matthew Ahrens
0dc2f70c5c OpenZFS 9486 - reduce memory used by device removal on fragmented pools
Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

* Add the following as module parameters:
    * zfs_condense_indirect_vdevs_enable
    * zfs_condense_max_obsolete_bytes

* Document the following module parameters:
   * zfs_condense_indirect_vdevs_enable
   * zfs_condense_max_obsolete_bytes
   * zfs_condense_min_mapping_bytes

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes #7536
2018-05-24 10:18:07 -07:00
Pavel Zakharov
38a19edd34 OpenZFS 9189 - Add debug to vdev_label_read_config when txg check fails
These changes were added to help debug issue #9187.

Essentially, in the original bug, vdev_validate() seems to fails in
vdev_label_read_config() and prints "failed reading config". This could
happen because either:
1. The labels are actually corrupt and zio_wait() fails for all of them
2. The labels were discarded because they didn't pass the txg check.

Beyond 9187, having debug info when case 2 happens could be useful in
other scenarios, such as zpool import.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by:  Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9189
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f6af1b7
Closes #7533
2018-05-14 14:32:49 -04:00
Pavel Zakharov
db7d07e14b OpenZFS 9191 - dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices
Add vdev_print_tree() in spa_check_for_missing_logs() when some log
devices are missing to ease debugging

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9191
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5c02e5
Closes #7531
2018-05-14 14:30:52 -04:00
Pavel Zakharov
a11c7aaec9 OpenZFS 9187 - racing condition between vdev label and spa_last_synced_txg in vdev_validate
ztest failed with uncorrectable IO error despite having the fix for
7163.  Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also
distinguishes it from that issue.

Definitely seems like a racing condition between the vdev_validate
and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to
   spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9187
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/805fda72
Closes #7529
2018-05-14 14:28:09 -04:00
Olaf Faaland
bc5f51c5de module param callbacks check for initialized spa
Callbacks provided for module parameters are executed both
after the module is loaded, when a user alters it via sysfs, e.g
	echo bar > /sys/modules/zfs/parameters/foo

as well as when the module is loaded with an argument, e.g.
	modprobe zfs foo=bar

In the latter case, the init functions likely have not run yet,
including spa_init() which initializes the namespace lock so it is safe
to use.

Instead of immediately taking the namespace lock and attemping to
iterate over initialized spa structures, check whether spa_mode_global
is nonzero.  This is set by spa_init() after it has initialized the
namespace lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7496 
Closes #7521
2018-05-11 12:46:07 -07:00
Tim Chase
d1043e2f6d Unify behavior of deadman parameters
The zfs_deadman_failmode, zfs_deadman_ziotime_ms and
zfs_deadman_synctime_ms paramaters are stored per-pool.  However,
only the zfs_deadman_failmode updates the per-pool state when it's
change.  This patch gives adds the same behavior to the other two
for consistency.

Also, in all 3 three cases, only update the per-pool parameters
if spa_init() has actually been called in order to avoid panicking
when trying to take a lock on the spa_namespace_lock mutex.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7499
2018-05-08 21:45:47 -07:00
Tim Chase
a0ad7ca54e Clear vdev_faulted
Clear vdev_faulted if ZPOOL_CONFIG_AUX_STATE is not set to "external"

ZoL supports "zpool export -f" (force fault), which can be combined
with "-t" (temporary fault; don't persist across export/import) and
causes a MOS configuration to be set with ZPOOL_CONFIG_FAULTED=1
and without ZFS_CONFIG_AUX_STATE set at all.  In this case, the
previously-offlined vdev should be imported in an on-line state and.
Clearing the "vdev_faulted" flag causes the import to treat the
device as on-line.  Typically, resilver will catch it up based on
its DTL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7459
2018-05-08 21:39:50 -07:00
Pavel Zakharov
6cb8e5306d OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2018-05-08 21:35:27 -07:00
Pavel Zakharov
afd2f7b711 OpenZFS 8962 - zdb should work on non-idle pools
Currently `zdb` consistently fails to examine non-idle pools as it
fails during the `spa_load()` process. The main problem seems to be
that `spa_load_verify()` fails as can be seen below:

    $ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

    ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from
`zdb`. This is done by creating a global flag in zfs and then setting
it in `zdb`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792
Closes #7459
2018-05-08 21:32:57 -07:00
Pavel Zakharov
4a0ee12af8 OpenZFS 8961 - SPA load/import should tell us why it failed
Problem
=======

When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".

While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support.  For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.

Solution
========

Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0
Closes #7459
2018-05-08 21:30:10 -07:00
Paul Dagnelie
ca0845d59e OpenZFS 9256 - zfs send space estimation off by > 10% on some datasets
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Added tuning to man page.
* Test case changes dropped, default behavior unchanged.

OpenZFS-issue: https://www.illumos.org/issues/9256
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56
Closes #7470
2018-05-08 08:59:24 -07:00
LOLi
4ceb8dd6fd Fix 'zpool create -t <tempname>'
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.

Fix this by specifying the correct pool name when setting the dataset
properties.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7502 
Closes #7509
2018-05-07 21:11:58 -07:00
Brian Behlendorf
c02c1becce
ZTS: Re-enable MMP tests
Commit 7fab6361 inadvertently disabled the MMP test cases by creating
and not removing an /etc/hostid file in the new zpool_split_props test
case.  When the file exists the ZTS skips the entire MMP test group
rather than modify what may be a system which is already configured.
Update the test case to remove the file.

Additionally, because the MMP tests were disabled a regression slipped
in as part of commit 9eb7b46ed0.  Fix it.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7514
2018-05-07 21:08:33 -07:00
Paul Dagnelie
64c1dcefe3 OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
     the deleted queue

It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.

We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.

Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.

We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:

1. When putting an object on the delete queue, change it's parent object
   number to a known constant that means NULL.

2. When displaying objects, first check if it is present on the delete
   queue.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2018-05-04 10:50:24 -07:00
Matthew Ahrens
5e097c67f1 OpenZFS 9443 - panic when scrub a v10 pool
While expanding stored pools, we ran into a panic using an old pool.

Steps to reproduce:

    $ sudo zpool create -o version=2 test c2t1d0
    $ sudo cp /etc/passwd /test/foo
    $ sudo zpool attach test c2t1d0 c2t2d0

We'll get this panic:

    ffffff000fc0e5e0 unix:real_mode_stop_cpu_stage2_end+b27c ()
    ffffff000fc0e6f0 unix:trap+dc8 ()
    ffffff000fc0e700 unix:cmntrap+e6 ()
    ffffff000fc0e860 zfs:dsl_scan_visitds+1ff ()
    ffffff000fc0ea20 zfs:dsl_scan_visit+fe ()
    ffffff000fc0ea80 zfs:dsl_scan_sync+1b3 ()
    ffffff000fc0eb60 zfs:spa_sync+435 ()
    ffffff000fc0ec20 zfs:txg_sync_thread+23f ()
    ffffff000fc0ec30 unix:thread_start+8 ()

The problem is a bad trap accessing a NULL pointer. We're looking for
the dp_origin_snap of a dsl_pool_t, but version 2 didn't have that. The
system will go into a reboot loop at this point, and the dump won't be
accessible except by removing the cache file from within the recovery
environment.

This impacts any sort of scrub or resilver on version <11 pools, e.g.:

    $ zpool create -o version=10 test c2t1d0
    $ zpool scrub test

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9443
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/010eed29
Closes #7501
2018-05-04 10:47:10 -07:00
Tom Caputi
be9a5c355c Add support for decryption faults in zinject
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.

This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7474
2018-05-02 15:36:20 -07:00
Brian Behlendorf
9464b9591e
RHEL 7.5 compat: FMODE_KABI_ITERATE
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.

Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:

* Requires that callers intending to use this extended interface
  set the FMODE_KABI_ITERATE flag on the file structure when
  opening the directory.
* Adds the fops.iterate() method to the end of the structure,
  without removing fops.readdir().

This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected.  Instead fallback to
the fops.readdir() interface which will be available.

Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7460 
Closes #7463
2018-05-02 15:01:24 -07:00
Brian Behlendorf
bc8a6a60e9
Fix inst_num overflow in qat_crypt.c
This patch fixes the same issue which was previously addressed in
6051.  The variable "inst_num" was of the incorrect type and
"atomic_inc_32_nv()" could cause an overflow damaging its neighbor.

Cast the return value of atomic_inc_32_nv() to Cpa32U.

Fix a few types for num_inst for clarity.

Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7468
2018-05-01 20:44:24 -07:00
Tom Caputi
2c24b5b148 Fix issues found with zfs diff
Two deadlocks / ASSERT failures were introduced in a2c2ed1b which
would occur whenever arc_buf_fill() failed to decrypt a block of
data. This occurred because the call to arc_buf_destroy() which
was responsible for cleaning up the newly created buffer would
attempt to take out the hdr lock that it was already holding. This
was resolved by calling the underlying functions directly without
retaking the lock.

In addition, the dmu_diff() code did not properly ensure that keys
were loaded and mapped before begining dataset traversal. It turns
out that this code does not need to look at any encrypted values,
so the code was altered to perform raw IO only.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7354 
Closes #7456
2018-05-01 11:24:20 -07:00
Tomohiro Kusumi
d6133fc500 Silence compile-time warning on unused variable
ASSERT3U() could be NOP which then leads to having unused pointer *spa.

metaslab.c: In function 'metaslab_condense':
metaslab.c:2075:9: warning: unused variable 'spa' [-Wunused-variable]
  spa_t *spa = msp->ms_group->mg_vd->vdev_spa;

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7489
2018-05-01 11:15:54 -07:00
loli10K
85ce3f4fd1 Adopt pyzfs from ClusterHQ
This commit introduces several changes:

 * Update LICENSE and project information

 * Give a good PEP8 talk to existing Python source code

 * Add RPM/DEB packaging for pyzfs

 * Fix some outstanding issues with the existing pyzfs code caused by
   changes in the ABI since the last time the code was updated

 * Integrate pyzfs Python unittest with the ZFS Test Suite

 * Add missing libzfs_core functions: lzc_change_key,
   lzc_channel_program, lzc_channel_program_nosync, lzc_load_key,
   lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops,
   lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync,
   lzc_unload_key, lzc_remap

Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow
to differentiate the case where we tried to unload a key on a
non-existing dataset (ENOENT) from the situation where a dataset has
no key loaded: this is consistent with the "change" case where trying
to zfs_ioc_change_key() from a dataset with no key results in EACCES.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7230
2018-05-01 10:33:35 -07:00
Alexander Motin
20507534d4 OpenZFS 9434 - Speculative prefetch is blocked by device removal code
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal.  At least one visual consequence
of it is completely blocked speculative prefetcher.  This patch sets
the variable in such situations.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9434
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/16127b627b
Closes #7480
2018-04-30 13:05:55 -07:00
Matthew Ahrens
964c2d69a9 OpenZFS 9236 - nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5

OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668
Closes #7467
2018-04-30 10:19:48 -07:00
Mark Wright
089500e792 Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build
Fix build errors with gcc 7.3.0 on Gentoo with kernel 4.16.3
built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as:

module/zfs/vdev_indirect.c:296:2: error:
positional initialization of field in ‘struct’ declared with
‘designated_init’ attribute [-Werror=designated-init]
  vdev_indirect_map_free,
  ^~~~~~~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Mark Wright <gienah@gentoo.org>
Closes #7464
2018-04-20 09:53:25 -07:00
Chunwei Chen
599b864813 Fix ENOSPC in "Handle zap_add() failures in ..."
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7401 
Closes #7421
2018-04-18 14:19:50 -07:00
Tom Caputi
b0ee5946aa Fix issues with raw sends of spill blocks
This patch fixes 2 issues in how spill blocks are processed during
raw sends. The first problem is that compressed spill blocks were
using the logical length rather than the physical length to
determine how much data to dump into the send stream. The second
issue is a typo that caused the spill record's object number to be
used where the objset's ID number was required. Both issues have
been corrected, and the payload_size is now printed in zstreamdump
for future debugging.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7378 
Closes #7432
2018-04-17 11:19:03 -07:00
Tom Caputi
e14a32b1c8 Fix object reclaim when using large dnodes
Currently, when the receive_object() code wants to reclaim an
object, it always assumes that the dnode is the legacy 512 bytes,
even when the incoming bonus buffer exceeds this length. This
causes a buffer overflow if --enable-debug is not provided and
triggers an ASSERT if it is. This patch resolves this issue and
adds an ASSERT to ensure this can't happen again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7097
Closes #7433
2018-04-17 11:13:57 -07:00
Matthew Ahrens
0c03d21ac9 assertion in arc_release() during encrypted receive
In the existing code, when doing a raw (encrypted) zfs receive, 
we call arc_convert_to_raw() from open context. This creates a 
race condition between arc_release()/arc_change_state() and 
writing out the block from syncing context (arc_write_ready/done()).

This change makes it so that when we are doing a raw (encrypted) 
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode 
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() 
from syncing context when writing out the block of dnodes.

Additionally, we can eliminate dr_raw and associated setters, and 
instead know that dnode blocks are always raw when doing a zfs 
receive (see the new field os_raw_receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7424 
Closes #7429
2018-04-17 11:06:54 -07:00
Matthew Ahrens
7f96cc23ac OpenZFS 9192 - explicitly pass good_writes to vdev_uberblock/label_sync
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

OpenZFS-issue: https://www.illumos.org/issues/9192
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f4c0b602d
Closes #7446
2018-04-17 10:45:47 -07:00
Matt Ahrens
d830d4795a OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices
Authored by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c
Closes #7445
2018-04-17 10:44:50 -07:00
megari
d68ac65eb6 Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier"
This reverts commit cbb8933215.

The original change in OpenZFS 9036 did remove duplicate 'const'
specifiers, but the ZoL port had already done what *should* have been
done in OpenZFS 9036, which is to make the pointers themselves const.
The port of the change to ZoL ended up doing an unnecessary removal
of the constness of the pointers. Undo that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Closes #7444
2018-04-16 12:44:40 -07:00
Pavel Zakharov
9eb7b46ed0 OpenZFS 7638 - Refactor spa_load_impl into several functions
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7638
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6
Closes #7437
2018-04-16 12:24:23 -07:00
Tim Chase
5284f43a1e Avoid Linux hung task message in ZTHR
Use an interruptible to avoid Linux hung task message in
ZTHR and to prevent inflating the load average.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7440 
Closes #7441
2018-04-15 15:12:28 -07:00
Toomas Soome
5e567da987 OpenZFS 9213 - zfs: sytem typo
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The additional instances of this typo addressed in the OpenZFS
  patch were already resolved.

OpenZFS-issue: https://illumos.org/issues/9213
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92
Closes #7436
2018-04-15 10:59:13 -07:00
Toomas Soome
cbb8933215 OpenZFS 9036 - zfs: duplicate 'const' declaration specifier
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9036
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434
Closes #6900
2018-04-14 12:40:52 -07:00
Prakash Surya
eecdd8e884 OpenZFS 9084 - spa_*_ashift must ignore spare devices
It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74,
        ebddef68, 8992dc0, ebe10a00, fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9084
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7
Closes #6900
2018-04-14 12:40:52 -07:00
Serapheim Dimitropoulos
4bf8108ede OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9080
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6
Closes #6900
2018-04-14 12:40:47 -07:00
Serapheim Dimitropoulos
9d5b524597 OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2018-04-14 12:23:53 -07:00
Brian Behlendorf
4589f3ae4c Optimize possible split block search space
Remove duplicate segment copies to minimize the possible search
space for reconstruction.  Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.

Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6900
2018-04-14 12:22:43 -07:00
Matthew Ahrens
9e052db462 OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-04-14 12:21:39 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Seth Forshee
93b43af10d Allow mounting datasets more than once
Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:

- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset

The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)

The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.

Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes #5796
Closes #7207
2018-04-13 10:44:05 -07:00
beren12
7403d0743e Fix zfs_arc_max minimum tuning
When setting `zfs_arc_max` its minimum value is allowed
to be 64 MiB.  There was an off-by-1 error which can matter
on tiny systems.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes #7417
2018-04-12 10:47:32 -07:00
Tom Caputi
edc1e713c2 Fix race in dnode_check_slots_free()
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147 
Closes #7388
2018-04-10 11:15:05 -07:00
Giuseppe Di Natale
10f88c5cd5 Linux compat 4.16: blk_queue_flag_{set,clear}
queue_flag_{set,clear}_unlocked are now private interfaces in
the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14).
Use blk_queue_flag_{set,clear} interfaces which were introduced as
of https://github.com/torvalds/linux/commit/8814ce8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7410
2018-04-10 10:32:14 -07:00
Tony Hutter
4f301661df Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
2018-04-09 14:24:46 -07:00
Brian Behlendorf
3b0d99289a
Fix 'zfs send/recv' hang with 16M blocks
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7365 
Closes #7404
2018-04-08 19:41:15 -07:00
Matthew Ahrens
5c27ec1088 Fixes for SNPRINTF_BLKPTR with encrypted BP's
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390
2018-04-06 13:30:26 -07:00
Olaf Faaland
0ba106e75c Fix divide-by-zero in mmp_delay_update()
vdev_count_leaves() in the denominator may return 0, caught by Coverity.
Introduced by

* 533ea04 Update mmp_delay on sync or skipped, failed write

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7391
2018-04-06 13:29:11 -07:00
Olaf Faaland
533ea0415b Update mmp_delay on sync or skipped, failed write
When an MMP write is skipped, or fails, and time since
mts->mmp_last_write is already greater than mts->mmp_delay, increase
mts->mmp_delay.  The original code only updated mts->mmp_delay when a
write succeeded, but this results in the write(s) after delays and
failed write(s) reporting an ub_mmp_delay which is too low.

Update mmp_last_write and mmp_delay if a txg sync was successful.  At
least one uberblock was written, thus extending the time we can be sure
the pool will not be imported by another host.

Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) /
vdev_count_leaves()) so that a period of frequent successful MMP writes,
e.g. due to frequent txg syncs, does not result in an import activity
check so short it is not reliable based on mmp thread writes alone.

Remove unnecessary local variable, start.  We do not use the start time
of the loop iteration.

Add a debug message in spa_activity_check() to allow verification of the
import_delay value and to prove the activity check occurred.

Alter the tests that import pools and attempt to detect an activity
check.  Calculate the expected duration of spa_activity_check() based on
module parameters at the time the import is performed, rather than a
fixed time set in mmp.cfg.  The fixed time may be wrong.  Also, use the
default zfs_multihost_interval value so the activity check is longer and
easier to recognize.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7330
2018-04-04 16:38:44 -07:00
Tony Hutter
21a4f5cc86 Fedora 28: Fix misc bounds check compiler warnings
Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7361 
Closes #7368
2018-04-04 10:16:47 -07:00
LOLi
1724eb62de Fix spa reference leak in zfs_ioc_pool_scan
zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a
valid pool_scrub_cmd_t: this could happen if the userland binaries
and ZFS kernel module differ in version and would prevent the pool from
being exported.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7380
2018-04-03 17:31:30 -07:00
Tim Chase
10adee27ce Remove ASSERT() in l2arc_apply_transforms()
The ASSERT was erroneously copied from the next section of code.
The buffer's size should be expanded from "psize" to "asize"
if necessary.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7375
2018-03-31 15:14:21 -07:00
Tom Caputi
a2c2ed1bd4 Decryption error handling improvements
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372
2018-03-31 11:12:51 -07:00