Commit Graph

2111 Commits

Author SHA1 Message Date
Andriy Gapon
931bef81c8 zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.

After this change, zfs_check_userprops() only validates the properties
without any permission checks.  The permissions are checked individually
for each snapshotted dataset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9179
Closes #9180
2019-09-25 11:27:50 -07:00
Tom Caputi
95319fc569 Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
2019-09-25 11:27:50 -07:00
Tony Hutter
023ab67a64 Linux 5.3: Fix switch() fall though compiler errors
Fix some switch() fall-though compiler errors:

    abd.c:1504:9: error: this statement may fall through

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9170
2019-09-25 11:27:50 -07:00
Chunwei Chen
569f5d5d05 Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
2019-09-25 11:27:50 -07:00
Matthew Ahrens
6c9882d5db Improve performance by using dmu_tx_hold_*_by_dnode()
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand.  This eliminates some calls to dnode_hold(), which can be
expensive.  This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.

We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.

This also removes some unnecessary casts that are in the area.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9081
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi
6c68594675 Implement secpolicy_vnode_setid_retain()
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.

https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9035
Closes #9043
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi
4f951b183c Don't directly cast unsigned long to void*
Cast to uintptr_t first for portability on integer to/from pointer
conversion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9065
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi
65a0b28b42 Fix module_param() type for zfs_read_chunk_size
zfs_read_chunk_size is unsigned long.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9051
2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos
1c4b0fc745 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015
Closes #9044
2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos
bbbe4b0a98 hdr_recl calls zthr_wakeup() on destroyed zthr
There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:

[1] hdr_recl() runs first and goes intro zthr_wakeup()
    because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
    that flag, destroying the zthr, and goes into
    buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
    and we blow up.

This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9047
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi
3c144b9267 Fix wrong comment on zcr_blksz_{min,max}
These aren't tunable; illumos has this comment fixed in
"3742 zfs comments need cleaner, more consistent style",
so sync with that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9052
2019-09-25 11:27:50 -07:00
jdike
4c98586daf Fix lockdep recursive locking false positive in dbuf_destroy
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #8984
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi
2b9f73e5e6 Use zfsctl_snapshot_hold() wrapper
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9039
2019-09-25 11:27:49 -07:00
Brian Behlendorf
446d08fba4 Fix get_special_prop() build failure
The cast of the size_t returned by strlcpy() to a uint64_t by the
VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE
is set.  This is due to the additional hardening.  Since the token
is expected to always fit in strval the VERIFY3U has been removed.
If somehow it doesn't, it will still be safely truncated.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8999
Closes #9020
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi
73e50a7d5d Drop redundant POSIX ACL check in zpl_init_acl()
ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(),
so no need to test again on POSIX ACL access.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9009
2019-09-25 11:27:49 -07:00
Brian Behlendorf
d751b12a9d Export dnode symbols
External consumers such as Lustre require access to the dnode
interfaces in order to correctly manipulate dnodes.

Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8994
Closes #9027
2019-09-25 11:27:49 -07:00
Tom Caputi
78831d4290 Ensure dsl_destroy_head() decrypts objsets
This patch corrects a small issue where the dsl_destroy_head()
code that runs when the async_destroy feature is disabled would
not properly decrypt the dataset before beginning processing.
If the dataset is not able to be decrypted, the optimization
code now simply does not run and the dataset is completely
destroyed in the DSL sync task.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9021
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi
0a223246e1 Disable unused pathname::pn_path* (unneeded in Linux)
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.

Saves stack a bit on lookup and unlink.

(#if0'd members instead of removing since comments refer to them.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9025
2019-09-25 11:27:49 -07:00
Nick Mattis
cf966cb19a Fixes: #8934 Large kmem_alloc
Large allocation over the spl_kmem_alloc_warn value was being performed.
Switched to vmem_alloc interface as specified for large allocations.
Changed the subsequent frees to match.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: nmattis <nickm970@gmail.com>
Closes #8934
Closes #9011
2019-09-25 11:27:49 -07:00
Tom Caputi
1f72a18f59 Remove VERIFY from dsl_dataset_crypt_stats()
This patch fixes an issue where dsl_dataset_crypt_stats() would
VERIFY that it was able to hold the encryption root. This function
should instead silently continue without populating the related
field in the nvlist, as is the convention for this code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8976
2019-09-25 11:27:49 -07:00
Paul Zuchowski
14a11bf2f6 Improve "Unable to automount" error message.
Having the mountpoint and dataset name both in the message made it
confusing to read.  Additionally, convert this to a zfs_dbgmsg rather than
sending it to the console.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8959
2019-09-25 11:27:49 -07:00
Brian Behlendorf
7a03d7c73c Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional
The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY
is set.  Therefore, the EQUIV check must be wrapped accordingly.
For the same reason the ASSERT in arc_buf_fill() in unsafe.
However, since it's largely redundant it has simply been removed.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8979
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi
093bb64461 Don't use d_path() for automount mount point for chroot'd process
Chroot'd process fails to automount snapshots due to realpath(3)
failure in mount.zfs(8).

Construct a mount point path from sb of the ctldir inode and dirent
name, instead of from d_path(), so that chroot'd process doesn't get
affected by its view of fs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8903
Closes #8966
2019-09-25 11:27:48 -07:00
George Wilson
7d2489cfad nopwrites on dmu_sync-ed blocks can result in a panic
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
   zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #8957
2019-09-25 11:27:48 -07:00
Alexander Motin
04d4df89f4 Avoid extra taskq_dispatch() calls by DMU
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8909
2019-09-25 11:27:48 -07:00
Tom Caputi
bfe5f029cf Fix error message on promoting encrypted dataset
This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8905
Closes #8935
2019-09-25 11:27:48 -07:00
Matthew Ahrens
7d64595c25 dn_struct_rwlock can not be held in dmu_tx_try_assign()
The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below).  The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).

This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8929
2019-09-25 11:27:48 -07:00
Paul Dagnelie
1fd28bd8d4 Add SCSI_PASSTHROUGH to zvols to enable UNMAP support
When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.

We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8933
2019-09-25 11:27:48 -07:00
Tomohiro Kusumi
ab24c9cd4c Prevent pointer to an out-of-scope local variable
`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8924
Closes #8940
2019-09-25 11:27:48 -07:00
Matthew Ahrens
3c2a42fd25 dedup=verify doesn't clear the blkptr's dedup flag
The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-48097
Closes #8936
2019-09-25 11:27:48 -07:00
Igor K
9af524b0ee Update vdev_ops_t from illumos
Align vdev_ops_t from illumos for better compatibility.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8925
2019-09-25 11:27:48 -07:00
Tom Caputi
b96ceeead2 Allow unencrypted children of encrypted datasets
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8737
Closes #8870
2019-09-25 11:27:48 -07:00
Alexander Motin
ed7b0d357a Minimize aggsum_compare(&arc_size, arc_c) calls.
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8901
2019-09-25 11:27:47 -07:00
Matthew Ahrens
6083f40387 panic in removal_remap test on 4K devices
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes #8893
2019-09-25 11:27:47 -07:00
Matthew Ahrens
592ee2e6dd compress metadata in later sync passes
Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata).  Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).

However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them.  It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed.  The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.

We should increase zfs_sync_pass_dont_compress.  In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-63431
Closes #8892
2019-09-25 11:27:47 -07:00
Alexander Motin
cab7d856ea Move write aggregation memory copy out of vq_lock
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8890
2019-09-25 11:27:47 -07:00
Matthew Ahrens
77e64c6fff ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread()
When running zloop, we occasionally see the following crash:

    dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
    ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]

The error value 0x1c is ENOSPC.

The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space().  This will allow the removal thread to schedule its
work even when the pool is low on space.  The "slop space" will provide
enough free space to sync out the txg.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-37853
Closes #8889
2019-09-25 11:27:47 -07:00
Tomohiro Kusumi
4f809bddc6 Fix lockdep warning on insmod
sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed #8868 on Fedora 29 running
kernel-debug.

This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8868
Closes #8884
2019-09-25 11:27:47 -07:00
Matthew Ahrens
516a08ebb4 fat zap should prefetch when iterating
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes #8862
2019-09-25 11:27:47 -07:00
Matthew Ahrens
812c36fc71 Target ARC size can get reduced to arc_c_min
Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance.  We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.

We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.

However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").

To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59431
Closes #8864
2019-09-25 11:27:47 -07:00
bnjf
fe11968bbf Fix typo in vdev_raidz_math.c
Fix typo in vdev_raidz_math.c

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brad Forschinger <github@bnjf.id.au>
Closes #8875
Closes #8880
2019-09-25 11:27:47 -07:00
Paul Dagnelie
6f7bc75825 Allow metaslab to be unloaded even when not freed from
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure,
on heavily fragmented pools they can still consume a significant
amount of memory. This problem is amplified when we fail to unload
metaslabs that we aren't using. Currently, we only unload a metaslab
during metaslab_sync_done; in order for that function to be called
on a given metaslab in a given txg, we have to have dirtied that
metaslab in that txg. If the dirtying was the result of an allocation,
we wouldn't be unloading it (since it wouldn't be 8 txgs since it
was selected), so in effect we only unload a metaslab during txgs
where it's being freed from.

We move the unload logic from sync_done to a new function, and
call that function on all metaslabs in a given vdev during
vdev_sync_done().

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8837
2019-09-25 11:27:47 -07:00
Allan Jude
60cbc18136 l2arc_apply_transforms: Fix typo in comment
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #8822
2019-09-25 11:27:46 -07:00
Serapheim Dimitropoulos
b63ed49c29 Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold
Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.

Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8859
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi
fafe72712a Drop objid argument in zfs_znode_alloc() (sync with OpenZFS)
Since zfs_znode_alloc() already takes dmu_buf_t*, taking another
uint64_t argument for objid is redundant. inode's ->i_ino does and
needs to match znode's ->z_id.

zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument
since vnode doesn't have vnode# in VFS (hence ->z_id exists).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8841
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi
328c95e391 Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.

The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8826
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi
e5a877c5d0 Update descriptions for vnops
These descriptions are not uptodate with the code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8767
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi
4933b0a25b Drop local definition of MOUNT_BUSY
It's accessible via <sys/mntent.h>.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8765
2019-09-25 11:27:46 -07:00
Alexander Motin
72888812b0 Fix comparison signedness in arc_is_overflowing()
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8873
2019-06-11 10:45:23 -07:00
Tom Caputi
581c77e725 Fix incorrect error message for raw receive
This patch fixes an incorrect error message that comes up when
doing a non-forcing, raw, incremental receive into a dataset
that has a newer snapshot than the "from" snapshot. In this
case, the current code prints a confusing message about an IVset
guid mismatch.

This functionality is supported by non-raw receives as an
undocumented feature, but was never supported by the raw receive
code. If this is desired in the future, we can probably figure
out a way to make it work.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8758
Closes #8863
2019-06-11 10:45:17 -07:00
Tom Caputi
8dc8bbde6e Reinstate raw receive check when truncating
This patch re-adds a check that was removed in 369aa50. The check
confirms that a raw receive is not occuring before truncating an
object's dn_maxblkid. At the time, it was believed that all cases
that would hit this code path would be handled in other places,
but that was not the case.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8852 
Closes #8857
2019-06-07 12:45:40 -07:00
Tom Caputi
35050ef39e Fix integer overflow of ZTOI(zp)->i_generation
The ZFS on-disk format stores each inode's generation ID as a 64
bit number on disk and in-core. However, the Linux kernel's inode
is only a 32 bit number. In most places, the code handles this
correctly, but the cast is missing in zfs_rezget(). For many pools,
this isn't an issue since the generation ID is computed as the
current txg when the inode is created and many pools don't have
more than 2^32 txgs.

For the pools that have more txgs, this issue causes any inode with
a high enough generation number to report IO errors after a call to
"zfs rollback" while holding the file or directory open. This patch
simply adds the missing cast.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8858
2019-06-07 12:45:40 -07:00
DeHackEd
58b2de6420 Wait in 'S' state when send/recv pipe is blocking
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8733 
Closes #8752
2019-06-07 12:45:40 -07:00
TulsiJain
11ad06d1d8 Make zfs_async_block_max_blocks handle zero correctly
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8829
Closes #8289
2019-06-07 12:45:40 -07:00
Brian Behlendorf
4f8eef29e0 Revert "Report holes when there are only metadata changes"
This reverts commit ec4f9b8f30 which introduced a narrow race which
can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO.  Resolve
the issue by revering this change to restore the previous behavior
which depends solely on checking the dirty list.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8816 
Closes #8834
2019-06-07 12:45:40 -07:00
Brian Behlendorf
a1eaf0dde0 Exclude log device ashift from normal class
When opening a log device during import its allocation bias will
not yet have been set by vdev_load().  This results in the log
device's ashift being incorrectly applied to the maximum ashift
of the vdevs in the normal class.  Which in turn prevents the
removal of any top-level devices due to the ashift check in the
spa_vdev_remove_top_check() function.

This issue is resolved by including vdev_islog in the check since
it will be set correctly during vdev_open().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8735
2019-06-07 12:45:40 -07:00
madz
580256045b Fix integer overflow in get_next_chunk()
dn->dn_datablksz type is uint32_t and need to be casted to uint64_t
to avoid an overflow when the record size is greater than 4 MiB.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr>
Closes #8778 
Closes #8797
2019-06-07 12:45:40 -07:00
loli10K
aaf3b30dcf Double-free of encryption wrapping key due to invalid pool properties
This commits fixes a double-free in zfs_ioc_pool_create() triggered by
specifying an unsupported combination of properties when creating a pool
with encryption enabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8791
2019-06-07 12:45:40 -07:00
Tom Caputi
69ae34076f Fix embedded bp accounting in count_block()
Currently, count_block() does not correctly account for the
possibility that the bp that is passed to it could be embedded.
These blocks shouldn't be counted since the work of scanning
these blocks in already handled when the containing block is
scanned. This patch simply resolves this issue by returning
early in this case.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8800 
Closes #8766
2019-06-07 12:39:13 -07:00
Tomohiro Kusumi
2fb37bcadd Linux 5.2 compat: Directly call wait_on_page_bit()
wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd.

Directly call wait_on_page_bit() without using wait_on_page_writeback()
interface, given zfs_putpage() is the only caller for now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8794
2019-06-07 12:39:13 -07:00
loli10K
f91e7e6284 Device removal panics on 32-bit systems
The issue is caused by an incorrect usage of the sizeof() operator
in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue
since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on
32-bit systems pointers are 4 bytes long which is not supported by
zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system
will cause the following failure:

VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22)
PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete()
Showing stack for process 1315
CPU: 6 PID: 1315 Comm: txg_sync Tainted: P           OE   4.4.69+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0
 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341
 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76
Call Trace:
 [<>] dump_stack+0x58/0x7c
 [<>] spl_dumpstack+0x23/0x27 [spl]
 [<>] spl_panic.cold.0+0x5/0x41 [spl]
 [<>] ? dbuf_rele+0x3e/0x90 [zfs]
 [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs]
 [<>] ? zap_lookup+0x57/0x70 [zfs]
 [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs]
 [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs]
 [<>] ? txg_verify+0x1d/0x160 [zfs]
 [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs]
 [<>] vdev_sync+0xbf/0x550 [zfs]
 [<>] ? mutex_lock+0x10/0x30
 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs]
 [<>] ? zap_contains+0x4d/0x70 [zfs]
 [<>] spa_sync+0x9f1/0x1b10 [zfs]
 ...
 [<>] ? kthread_stop+0x110/0x110

This commit simply corrects the "integer_size" parameter used to lookup
the vdev's ZAP object.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8790
2019-06-07 12:39:13 -07:00
loli10K
cc434dcf45 Fix coverity defects: CID 186143
CID 186143: Memory - illegal accesses (USE_AFTER_FREE)

This patch fixes an use-after-free in spa_import_progress_destroy()
moving the kmem_free() call at the end of the function.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8788
2019-06-07 12:39:13 -07:00
Richard Elling
78fac8d925 Fix kstat state update during pool transition
When reading kstats, the health (aka state) of the pool is stored into
/proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name().
However, during import/export there is a case where the spa exists,
but the root vdev does not exist. This fix checks that case and sets
the state to "TRANSITIONING"

Unfortunately, it is not easy to reproduce a test for this. It was
detected randomly during ZTS runs while kstats were also being sampled
regularly. After this change, further testing did not trip on the case
and the TRANSITIONING state was collected at least once by the kstats.

For posterity, the backtrace prior to this fix is:
[Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs]
...
Mon May 13 17:21:00 2019] Call Trace:
[Mon May 13 17:21:00 2019]  spa_state_data+0x1a/0x40 [zfs]
[Mon May 13 17:21:00 2019]  kstat_seq_show+0x117/0x440 [spl]
[Mon May 13 17:21:00 2019]  seq_read+0xe5/0x430
[Mon May 13 17:21:00 2019]  proc_reg_read+0x45/0x70
[Mon May 13 17:21:00 2019]  __vfs_read+0x1b/0x40
[Mon May 13 17:21:00 2019]  vfs_read+0x8e/0x130
[Mon May 13 17:21:00 2019]  SyS_read+0x55/0xc0
[Mon May 13 17:21:00 2019]  ? SyS_fcntl+0x5d/0xb0
[Mon May 13 17:21:00 2019]  do_syscall_64+0x73/0x130
[Mon May 13 17:21:00 2019]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8746
2019-05-23 14:28:53 -07:00
Brian Behlendorf
bff2361aeb
Linux 5.2 compat: rw_tryupgrade()
Commit torvalds/linux@46ad0840b has removed the architecture specific
rwsem source and headers leaving only the generic version.  As part
of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS
macros were moved to the private kernel/locking/rwsem.h header.
This results in a build failure because these macros were required
to implement the rw_tryupgrade() compatibility function.

In practice, this isn't a major problem because there are only a
few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade
should be written to retry using rw_enter(RW_WRITER).

After auditing all of the callers only dmu_zfetch() was determined
not to perform a retry.  It has been updated in this commit to
resolve this issue.

That said, the rw_tryupgrade() functionality should be considered
for possible removal in a future release due to the difficultly
in supporting the interface.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8730
2019-05-23 13:46:33 -07:00
Paul Dagnelie
e61b53475e Fix incorrect assertion in dnode_dirty_l1range
The db_dirtycnt of an EVICTING dbuf is always 0. However, it still 
appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a 
range that contains an EVICTING dbuf, we will attempt to mark it dirty 
(which will fail because it's EVICTING, resulting in a new dbuf being 
created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the 
dbufs in the range are dirty. If the EVICTING dbuf is still present, 
this will trip the assertion erroneously.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8745
2019-05-19 17:30:33 -07:00
Olaf Faaland
ca95f70dff zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time.  The user may
not know why, or be able to tell whether the import is progressing or is
hung.

Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one).  The kstat is /proc/spl/kstat/zfs/import_progress.

The kstat contents are as follows:
pool_guid         load_state multihost_secs  max_txg pool_name
16667015954387398 3          15              0       tank3

load_state: the value of spa_load_state
multihost_secs:  seconds until the end of the multihost activity
                 check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring

This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting.  The
zpool import subcommand could also be modified to report this
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 10:08:05 -07:00
Tomohiro Kusumi
b689de85e8 Add missing trailing '\n' in printk() messages
These messages will want '\n' like any other regular printk() messages.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8726
2019-05-08 16:43:55 -07:00
Tomohiro Kusumi
97aa3ba44f Fix link count of root inode when snapdir is visible
Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by
->getattr() of zpl inodes) also needs to consider an additional link
count if "snapdir" property is set to "visible".

Without this, # of directories in root inode of each dataset doesn't
match the link count when snapdir is visible.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8727
2019-05-08 16:40:51 -07:00
Brian Behlendorf
a20f43b51b
Linux 5.0 compat: ASM_BUG macro
The 5.0 kernel defines the macro ASM_BUG.  In order to prevent a
conflict and build failure rename ASM_BUG to ZFS_ASM_BUG.  This
is currently only an issue on aarch64 but all instances of
ASM_BUG we're renamed to avoid any future conflict on x86_64.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8725 
Issue #8545
2019-05-08 10:18:40 -07:00
Brian Behlendorf
515ddf6504
Fix errant EFAULT during writes (#8719)
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8640 
Closes #8719
2019-05-08 10:04:04 -07:00
DeHackEd
1f02ecc5a5 Make zfs_special_class_metadata_reserve_pct into a parameter
Exported and documented a new module parameter.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8706
2019-05-07 15:34:42 -07:00
Brian Behlendorf
caf9dd209f
Fix send/recv lost spill block
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object.  It may need to be removed or kept depending on how the
object was modified at the source.

This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties.  This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.

What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose.  When set then
the object references a spill block and it must be kept.  Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record.  Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.

Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible.  Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed.  Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.

The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS.  Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8668
2019-05-07 15:18:44 -07:00
Tomohiro Kusumi
9c53e51616 Fix zfs set atime|relatime=off|on behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.

(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)

--
 # zpool create p1 <device>
 # zfs create p1/f1
 # zfs set atime=off p1
 # echo test > /p1/f1/test
 # sync
 # zfs list
 NAME    USED  AVAIL     REFER  MOUNTPOINT
 p1      176K  18.9G     25.5K  /p1
 p1/f1    26K  18.9G       26K  /p1/f1
 # zfs get atime
 NAME   PROPERTY  VALUE  SOURCE
 p1     atime     off    local
 p1/f1  atime     off    inherited from p1
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:33.741205192 +0900
 # cat /p1/f1/test
 test
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:50.173231861 +0900
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--

The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.

Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.

Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.

The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674 
Closes #8675
2019-05-07 10:06:30 -07:00
Tomohiro Kusumi
a762893269 Fix typo/etc in module/zfs/zfs_ctldir.c
Drop duplicated phrases in comments.

Also drop an obsolete comment "Perform a mount of the associated...",
as all it does now is get objid from DMU and lookup incore inode.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8707
2019-05-05 15:57:23 -07:00
Tomohiro Kusumi
de3e0b914b Linux 5.0 compat: Use totalhigh_pages()
Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839
"mm: convert totalram_pages and totalhigh_pages variables to atomic"

replaced `totalhigh_pages` with an inline function `totalhigh_pages()`.
This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages`
on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8677
Closes #8701
2019-05-04 16:40:48 -07:00
John Gallagher
1eacf2b3b0 Improve rate at which new zvols are processed
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #8526 
Closes #8615
2019-05-04 16:39:10 -07:00
Matthew Ahrens
8d9f616511 Reword comment in lz4_compress_zfs
The comment in lz4_compress_zfs could be more clear and specific.  It
also contains needlessly strong language.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes: #8702
Closes: #8703
2019-05-02 16:46:04 -07:00
Tom Caputi
fa24166074 Add feature check for 'zpool resilver' command
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8700
2019-05-02 16:42:31 -07:00
Tomohiro Kusumi
f0ce0436aa Correct snprintf() size argument
The size argument of snprintf(3) in glibc and snprintf() in Linux
kernel includes trailing \0, as snprintf(3) man page explains it as
"write at most size bytes (including the trailing null byte ('\0'))",
i.e. snprintf() can just take buffer size.

e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is
MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size
should just be `path_len` to do what the caller is trying to do.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8692
2019-04-30 19:41:12 -07:00
Brian Behlendorf
c12ea77865
Linux 5.0 compat: Remove incorrect ASSERT
Not all block devices, notably scsi_debug, set a root_blkg on the
request queue.  Remove this assertion and allow the the existing
call to blkg_tryget() to gracefully handle the NULL (which it does).

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8678
2019-04-29 18:20:21 -07:00
Tomohiro Kusumi
126d0fa733 Use SEEK_{SET,CUR,END} for file seek "whence"
Use either SEEK_* or 0,1,2..., but not both.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8656
2019-04-25 10:17:27 -07:00
Tom Caputi
f4c594da94 Fixes for the DMU free throttle
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of effected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.

The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8655
2019-04-25 10:16:24 -07:00
Tomohiro Kusumi
34d343c2a8 Drop unused ZNODE_STATS and ZNODE_STAT_ADD()
Unused since 5649246dd3("Remove znode move functionality"),
and ZNODE_STAT_ADD() will never be needed.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8636
2019-04-19 12:05:15 -07:00
Tomohiro Kusumi
a35c12073c Fix incorrect "[UNUSED]" comments
These aren't unused.
`flag` in zfs_create() also isn't to indicate large file.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8635
2019-04-19 12:03:32 -07:00
cfzhu
5090f72743 Code improvement and bug fixes for QAT support
1. Support QAT when ZFS is root file-system:
   When ZFS module is loaded before QAT started, the QAT can
   be started again in post-process, e.g.:
   echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
2. Verify alder checksum of the de-compress result
3. Allocate Digest, IV and AAD buffer in physical contiguous
   memory by QAT_PHYS_CONTIG_ALLOC.
4. Update the documentation for zfs_qat_compress_disable,
   zfs_qat_checksum_disable, zfs_qat_encrypt_disable.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Closes #8323 
Closes #8610
2019-04-16 12:38:36 -07:00
Richard Laager
fcf21f8fcb Update a comment to match the code
GRUB supports large_blocks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:33 -07:00
Richard Laager
393363c5ec Consistently captialize GUID for features
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:51 -07:00
Richard Laager
612c4930dd Fix the spelling of deferred
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:45 -07:00
Tom Caputi
c2c6eadf29 Fix issues with truncated files in raw sends
When receiving a raw send stream only reallocated objects
whose contents were not freed by the standard indicators
should call dmu_free_long_range().

Furthermore, if calling dmu_free_long_range() is required
then the objects current block size must be used and not
the new block size.

Two additional test cases were added to provided realistic
test coverage for processing reallocated objects which are
part of a raw receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8528
Closes #8607
2019-04-15 15:28:48 -07:00
Tom Caputi
7dcd318832 Cleanup nits from ab7615d92
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8619
2019-04-14 11:03:06 -07:00
Brian Behlendorf
b92f5d9f82
Fix issue in receive_object() during reallocation
When receiving an object to a previously allocated interior slot
the new object should be "allocated" by setting DMU_NEW_OBJECT,
not "reallocated" with dnode_reallocate().  For resilience verify
the slot is free as required in case the stream is malformed.

Add a test case to generate more realistic incremental send streams
that force reallocation to occur during the receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8067 
Closes #8614
2019-04-12 14:28:04 -07:00
Brian Behlendorf
3fa93bb8d3
Fix TXG_MASK cstyle
Fix style issue for 'tx->tx_txg&TXG_MASK'.  There should be white
space around the '&' character.  Split the dnode_reallocate() ASSERT
to make it more readable to clearly separate the checks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8606
2019-04-12 11:30:59 -07:00
Jorgen Lundman
48ed0f9da0 Always call rw_init in zio_crypt_key_unwrap
The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which
calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized.

We move the rw_init() call to the start of zio_crypt_key_unwrap instead.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8604
Closes #8605
2019-04-10 15:39:40 -07:00
Tim Chase
8cb34421e0 Avoid stack overwrite in zfs_setattr_dir()
The bulk[] array index, count, must be reset per-iteration in order to
not overwrite the stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8072
Closes #8597
Closes #8601
2019-04-10 15:38:21 -07:00
Brian Behlendorf
d93d4b1acd
Revert "Fix issues with truncated files in raw sends"
This partially reverts commit 5dbf8b4ed.  This change resolved
the issues observed with truncated files in raw sends.  However,
the required changes to dnode_allocate() introduced a regression
for non-raw streams which needs to be understood.

The additional debugging improvements from the original patch
were not reverted.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7378
Issue #8528
Issue #8540
Issue #8565
Close #8584
2019-04-05 17:32:56 -07:00
Matthew Ahrens
944a37248a predictive prefetch disabled on new pools until export/reboot
When a pool is initially created (by `zpool create`), predictive
prefetch is inadvertently disabled, until the pool is export/import-ed,
or the machine is rebooted.

When device removal was introduced, we added some code to disable
predictive prefetching until indirect vdevs have been loaded.  This
resulted in the "default state" of prefetch being disabled, until we
proactively enable it after indirect vdevs are loaded.  Unfortunately
this resulted in a few bugs where in some code paths we neglect to
enable predictive prefetch.  The first of these was fixed by
20507534d4

This commit fixes another case where we also need to explicitly enable
predictive prefetch, when the pool is initially created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8577
2019-04-05 10:12:02 -07:00
Don Brady
b4ddec7af6 features.kernel layout should match features.pool
The features.kernel layout should match features.pool.

Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8566
2019-04-04 19:00:55 -07:00
Sara Hartse
a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Brian Behlendorf
f4e35b165c
Fix txg_wait_open() load average inflation
Callers of txg_wait_open() which set should_quiesce=B_TRUE should be
accounted for as iowait time.  Otherwise, the caller is understood
to be idle and cv_wait_sig() is used to prevent incorrectly inflating
the system load average.

Similarly txg_wait_wait() has been updated to use cv_wait_io() to
be accounted against iowait.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8550
Closes #8558
2019-04-04 09:44:46 -07:00
Brian Behlendorf
1b939560be
Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00