Commit Graph

1965 Commits

Author SHA1 Message Date
Chunwei Chen
035e96118b Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
2019-08-28 10:42:02 -07:00
Tom Caputi
e7a2fa70c3 Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
2019-08-27 09:55:51 -07:00
Matthew Ahrens
325d288c5d Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing.  With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.

By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes #8744
2019-08-20 11:34:52 -07:00
Paul Dagnelie
f09fda5071 Cap metaslab memory usage
On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9128
2019-08-16 09:08:21 -06:00
Paul Dagnelie
dc04a8c757 Prevent race in blkptr_verify against device removal
When we check the vdev of the blkptr in zfs_blkptr_verify, we can run 
into a race condition where that vdev is temporarily unavailable. This 
happens when a device removal operation and the old vdev_t has been 
removed from the array, but the new indirect vdev has not yet been 
inserted.

We hold the spa_config_lock while doing our sensitive verification. 
To ensure that we don't deadlock, we only grab the lock if we don't 
have config_writer held. In addition, I had to const the tags of the 
refcounts and the spa_config_lock arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9112
2019-08-13 21:24:43 -06:00
Chunwei Chen
8e556c5ebc Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
2019-08-13 21:21:27 -06:00
George Wilson
c8242a96ba spa_load_verify() may consume too much memory
When a pool is imported it will scan the pool to verify the integrity 
of the data and metadata. The amount it scans will depend on the 
import flags provided. On systems with small amounts of memory or 
when importing a pool from the crash kernel, it's possible for 
spa_load_verify to issue too many I/Os that it consumes all the memory 
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: George Wilson george.wilson@delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146
2019-08-13 08:11:57 -06:00
Tomohiro Kusumi
a43570c5f3 Change boolean-like uint8_t fields in znode_t to boolean_t
Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9092
2019-08-13 07:58:02 -06:00
Richard Yao
fccbd1d6e2 Drop KMC_NOEMERGENCY
This is not implemented. If it were implemented, using it would risk
deadlocks on pre-3.18 kernels. Lets just drop it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #9119
2019-08-13 07:46:12 -06:00
Paul Dagnelie
c81f1790e2 Metaslab max_size should be persisted while unloaded
When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9055
2019-08-05 14:34:27 -07:00
Sara Hartse
37f03da8ba Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
2019-07-26 10:54:14 -07:00
Matthew Ahrens
1ff46825e2 Replace zf_rwlock with a mutex
The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.  Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9062
2019-07-25 11:57:58 -07:00
Serapheim Dimitropoulos
43a8536260 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015 
Closes #9044
2019-07-18 13:02:33 -07:00
Brian Behlendorf
d64dd3b62a Retire unused spl_{mutex,rwlock}_{init_fini}
These functions are unused and can be removed along
with the spl-mutex.c and spl-rwlock.c source files.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:13:53 -07:00
Brian Behlendorf
e7a99dab2b Linux 5.3 compat: retire rw_tryupgrade()
The Linux kernel's rwsem's have never provided an interface to
allow a reader to be upgraded to a writer.  Historically, this
functionality has been implemented by a SPL wrapper function.
However, this approach depends on internal knowledge of the
rw_semaphore and is therefore rather brittle.

Since the ZFS code must always be able to fallback to rw_exit()
and rw_enter() when an rw_tryupgrade() fails; this functionality
isn't critical.  Furthermore, the only potentially performance
sensitive consumer is dmu_zfetch() and no decrease in performance
was observed with this change applied.  See the PR comments for
additional testing details.

Therefore, it is being retired to make the build more robust and
to simplify the rwlock implementation.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:08:56 -07:00
Brian Behlendorf
041205afee Linux 5.3 compat: rw_semaphore owner
Commit https://github.com/torvalds/linux/commit/94a9717b updated the
rwsem's owner field to contain additional flags describing the rwsem's
state.  Rather then update the wrappers to mask out these bits, the
code no longer relies on the owner stored by the kernel.  This does
increase the size of a krwlock_t but it makes the implementation
less sensitive to future kernel changes.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-07-17 15:07:46 -07:00
jdike
a649768a17 Fix lockdep recursive locking false positive in dbuf_destroy
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #8984
2019-07-17 09:18:24 -07:00
Brian Behlendorf
095b5412b3
Fix CONFIG_X86_DEBUG_FPU build failure
When CONFIG_X86_DEBUG_FPU is defined the alternatives_patched symbol
is pulled in as a dependency which results in a build failure.  To
prevent this undefine CONFIG_X86_DEBUG_FPU to disable the WARN_ON_FPU()
macro and rely on WARN_ON_ONCE debugging checks which were previously
added.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9041
Closes #9049
2019-07-17 09:14:36 -07:00
Brian Behlendorf
8062b7686a
Minor style cleanup
Resolve an assortment of style inconsistencies including
use of white space, typos, capitalization, and line wrapping.
There is no functional change.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9030
2019-07-16 17:22:31 -07:00
Serapheim Dimitropoulos
93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
Tomohiro Kusumi
ff9630d1a8 Disable unused pathname::pn_path* (unneeded in Linux)
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.

Saves stack a bit on lookup and unlink.

(#if0'd members instead of removing since comments refer to them.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9025
2019-07-15 13:57:56 -07:00
Brian Behlendorf
e5db313494
Linux 5.0 compat: SIMD compatibility
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels.  This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.

For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used.  All non-kernel
threads will restore the preserved user FPU state.

For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers.  Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.

In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS.  The vast majority of SIMD operations are
already performed by the IO pipeline.  The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact.  The two most noteworthy cases are:

  1) Decrypting the wrapping key for an encrypted dataset,
     i.e. `zfs load-key`.  All other encryption and decryption
     operations will use the SIMD optimized implementations.

  2) Generating the payload checksums for a `zfs send` stream.

In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context.  This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).

The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time.  This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.

Finally, a few other modifications were made in order to further
harden this code and facilitate testing.  They include updating
each implementations operations structure to be declared as a
constant.  And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8754 
Closes #8793 
Closes #8965
2019-07-12 09:31:20 -07:00
Paul Dagnelie
f664f1ee7f Decrease contention on dn_struct_rwlock
Currently, sequential async write workloads spend a lot of time 
contending on the dn_struct_rwlock. This lock is responsible for 
protecting the entire block tree below it; this naturally results 
in some serialization during heavy write workloads. This can be 
resolved by having per-dbuf locking, which will allow multiple 
writers in the same object at the same time.

We introduce a new rwlock, the db_rwlock. This lock is responsible 
for protecting the contents of the dbuf that it is a part of; when 
reading a block pointer from a dbuf, you hold the lock as a reader. 
When writing data to a dbuf, you hold it as a writer. This allows 
multiple threads to write to different parts of a file at the same 
time.

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens matt@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
External-issue: DLPX-52564
External-issue: DLPX-53085
External-issue: DLPX-57384
Closes #8946
2019-07-08 13:18:50 -07:00
Brad Lewis
cb70964221 8659 static dtrace probes unavailable on non-GPL modules
ZFS tracing efforts are hampered by the inability to access zfs static
probes(probes using DTRACE_PROBE macros). The probes are available via
tracepoints for GPL modules only.  The build could be modified to
generate a function for each unique DTRACE_PROBE invocation. These could
be then accessed via kprobes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>
Closes #8659 
Closes #8663
2019-07-08 11:20:53 -07:00
loli10K
3b5fe2c351 Fix zfs "redact" misc issues
* zfs redact error messages do not end with newline character
 * 30af21b0 inadvertently removed some ZFS_PROP comments
 * man/zfs: zfs redact <redaction_snapshot> is not optional

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8988
2019-07-05 16:38:17 -07:00
Mike Gerdts
341166c843 OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocks
When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Mike Gerdts <mike.gerdts@joyent.com>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0
Closes #8973
2019-07-05 15:35:15 -07:00
Alexander Motin
fc7546777b Avoid extra taskq_dispatch() calls by DMU
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8909
2019-06-25 12:03:38 -07:00
loli10K
746d4a451e Fix bp_embedded_type enum definition
With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8951
2019-06-24 18:02:17 -07:00
Matthew Ahrens
59ec30a329 Remove code for zfs remap
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs.  This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.

Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8944
2019-06-24 16:44:01 -07:00
Don Brady
186898bbb5 OpenZFS 9425 - channel programs can be interrupted
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.

Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete.  Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.

Porting notes: Added missing return value from cv_wait_sig()

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926
Closes #8904
2019-06-22 16:51:46 -07:00
Paul Dagnelie
2b09628b59 Fix comments on zfs_bookmark_phys
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8945
2019-06-22 16:32:26 -07:00
Tom Caputi
da68988708 Allow unencrypted children of encrypted datasets
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8737 
Closes #8870
2019-06-20 12:29:51 -07:00
Matthew Ahrens
050d720c43 Remove dedupditto functionality
If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x).  The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.

ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ).  Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful.  The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad.  This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.

Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.

Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).

In summary, this seldom-used feature doesn't work, and even if it did it
wouldn't provide useful data protection.  It has a non-trivial
maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270).

We should remove the dedupditto functionality.  For backwards
compatibility with the existing CLI, "zpool set dedupditto" will still
"succeed" (exit code zero), but won't have any effect.  For backwards
compatibility with existing pools that had dedupditto enabled at some
point, the code will still be able to understand dedupditto blocks and
free them when appropriate.  However, ZFS won't write any new dedupditto
blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Issue #8270 
Closes #8310
2019-06-19 14:54:02 -07:00
Paul Dagnelie
30af21b025 Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to 
a target system. One possible use case for this feature is to not 
transmit sensitive information to a data warehousing, test/dev, or 
analytics environment. Another is to save space by not replicating 
unimportant data within a given dataset, for example in backup tools 
like zrepl.

Redacted send/receive is a three-stage process. First, a clone (or 
clones) is made of the snapshot to be sent to the target. In this 
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction 
snapshot" (or snapshots). Second, the new zfs redact command is used 
to create a redaction bookmark. The redaction bookmark stores the 
list of blocks in a snapshot that were modified by the redaction 
snapshot(s). Finally, the redaction bookmark is passed as a parameter 
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive 
or unwanted information, and those blocks are not included in the send 
stream.  When sending from the redaction bookmark, the blocks it 
contains are considered as candidate blocks in addition to those 
blocks in the destination snapshot that were modified since the 
creation_txg of the redaction bookmark.  This step is necessary to 
allow the target to rehydrate data in the case where some blocks are 
accidentally or unnecessarily modified in the redaction snapshot.

The changes to bookmarks to enable fast space estimation involve 
adding deadlists to bookmarks. There is also logic to manage the 
life cycles of these deadlists.

The new size estimation process operates in cases where previously 
an accurate estimate could not be provided. In those cases, a send 
is performed where no data blocks are read, reducing the runtime 
significantly and providing a byte-accurate size estimate.

Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 09:48:12 -07:00
Matthew Ahrens
7218b29e4b lz4_decompress_abd declared but not defined
`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-47477
Closes #8894
2019-06-13 13:14:34 -07:00
Matthew Ahrens
53dce5acc6 panic in removal_remap test on 4K devices
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes #8893
2019-06-13 13:12:39 -07:00
Tulsi Jain
9c7da9a95a Restrict filesystem creation if name referred either '.' or '..'
This change restricts filesystem creation if the given name
contains either '.' or '..'

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8842 
Closes #8564
2019-06-13 08:56:15 -07:00
Matthew Ahrens
d9b4bf0665 fat zap should prefetch when iterating
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes #8862
2019-06-12 13:13:09 -07:00
Matthew Ahrens
5662fd5794 single-chunk scatter ABDs can be treated as linear
Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a
single "chunk" (higher-order "page" which represents a power-of-2 series
of physically-contiguous pages).  This is often the case for 2-page (8K)
ABD's.

Representing a single-entry scatter ABD as a linear ABD has the
performance advantage of avoiding the copy (and allocation) in
abd_borrow_buf_copy / abd_return_buf_copy.  A performance increase of
around 5% has been observed for ARC-cached reads (of small blocks which
can take advantage of this), fixing the regression introduced by
87c25d567.

Note that this optimization is only possible because all physical memory
is always mapped into the kernel's address space.  This is not the case
for HIGHMEM pages, so the optimization can not be made on 32-bit
systems.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8580
2019-06-11 09:02:31 -07:00
Matthew Ahrens
b8738257c2 make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations.  The large allocations are for ZIL blocks.  If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more.  In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
2019-06-10 11:48:42 -07:00
Paul Dagnelie
893a6d62c1 Allow metaslab to be unloaded even when not freed from
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure, 
on heavily fragmented pools they can still consume a significant 
amount of memory. This problem is amplified when we fail to unload 
metaslabs that we aren't using. Currently, we only unload a metaslab 
during metaslab_sync_done; in order for that function to be called 
on a given metaslab in a given txg, we have to have dirtied that 
metaslab in that txg. If the dirtying was the result of an allocation, 
we wouldn't be unloading it (since it wouldn't be 8 txgs since it 
was selected), so in effect we only unload a metaslab during txgs 
where it's being freed from.

We move the unload logic from sync_done to a new function, and 
call that function on all metaslabs in a given vdev during 
vdev_sync_done().

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8837
2019-06-06 19:10:43 -07:00
Tomohiro Kusumi
fe0c9f409a Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.

The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8826
2019-05-29 16:18:14 -07:00
loli10K
d28b492ab3 VERIFY3P() message is missing a space character
This commit just reintroduces a [space] character inadvertently removed
in a887d653.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8786
2019-05-24 14:06:53 -07:00
Tomohiro Kusumi
8708fd888f Linux 2.6.39 compat: Test if kstrtoul() exists
kstrtoul() exists only after torvalds/linux@33ee3b2e2e in 2.6.39.
Use strict_strtoul() if kstrtoul() doesn't exist.
Note that strict_strtoul() has existed as an alias for kstrtoul()
for a while, but removed in torvalds/linux@3db2e9cdc0.

It looks like RHEL6 (2.6.32 based) has backported kstrtoul(),
and this caused build CI to pass compilation test.
It should fail on vanilla < 2.6.39 kernels or distro kernels without
backport as reported in #8760.

--
 # grep "kstrtoul(" /lib/modules/2.6.32-754.12.1.el6.x86_64/build/ \
 include/linux/kernel.h >/dev/null
 # echo $?
 0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8760 
Closes #8761
2019-05-24 12:26:18 -07:00
Rafael Kitover
8b8b44d06f kernel timer API rework
In `config/kernel-timer.m4` refactor slightly to check more generally
for the new `timer_setup()` APIs, but also check the callback signature
because some kernels (notably 4.14) have the new `timer_setup()` API but
use the old callback signature. Also add a check for a `flags` member in
`struct timer_list`, which was added in 4.1-rc8.

Add compatibility shims to `include/spl/sys/timer.h` to allow using the
new timer APIs with the only two caveats being that the callback
argument type must be declared as `spl_timer_list_t` and an explicit
assignment is required to get the timer variable for the `timer_of()`
macro. So the callback would look like this:

```c
__cv_wakeup(spl_timer_list_t t)
{
        struct timer_list *tmr = (struct timer_list *)t;
	struct thing *parent = from_timer(parent, tmr,
		parent_timer_field);
	... /* do stuff with parent */
```

Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the
new timer APIs instead of conditional code.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #8647
2019-05-23 14:40:28 -07:00
Brian Behlendorf
bff2361aeb
Linux 5.2 compat: rw_tryupgrade()
Commit torvalds/linux@46ad0840b has removed the architecture specific
rwsem source and headers leaving only the generic version.  As part
of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS
macros were moved to the private kernel/locking/rwsem.h header.
This results in a build failure because these macros were required
to implement the rw_tryupgrade() compatibility function.

In practice, this isn't a major problem because there are only a
few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade
should be written to retry using rw_enter(RW_WRITER).

After auditing all of the callers only dmu_zfetch() was determined
not to perform a retry.  It has been updated in this commit to
resolve this issue.

That said, the rw_tryupgrade() functionality should be considered
for possible removal in a future release due to the difficultly
in supporting the interface.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8730
2019-05-23 13:46:33 -07:00
Olaf Faaland
ca95f70dff zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time.  The user may
not know why, or be able to tell whether the import is progressing or is
hung.

Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one).  The kstat is /proc/spl/kstat/zfs/import_progress.

The kstat contents are as follows:
pool_guid         load_state multihost_secs  max_txg pool_name
16667015954387398 3          15              0       tank3

load_state: the value of spa_load_state
multihost_secs:  seconds until the end of the multihost activity
                 check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring

This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting.  The
zpool import subcommand could also be modified to report this
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 10:08:05 -07:00
Brian Behlendorf
caf9dd209f
Fix send/recv lost spill block
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object.  It may need to be removed or kept depending on how the
object was modified at the source.

This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties.  This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.

What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose.  When set then
the object references a spill block and it must be kept.  Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record.  Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.

Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible.  Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed.  Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.

The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS.  Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8668
2019-05-07 15:18:44 -07:00
Tomohiro Kusumi
9c53e51616 Fix zfs set atime|relatime=off|on behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.

(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)

--
 # zpool create p1 <device>
 # zfs create p1/f1
 # zfs set atime=off p1
 # echo test > /p1/f1/test
 # sync
 # zfs list
 NAME    USED  AVAIL     REFER  MOUNTPOINT
 p1      176K  18.9G     25.5K  /p1
 p1/f1    26K  18.9G       26K  /p1/f1
 # zfs get atime
 NAME   PROPERTY  VALUE  SOURCE
 p1     atime     off    local
 p1/f1  atime     off    inherited from p1
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:33.741205192 +0900
 # cat /p1/f1/test
 test
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:50.173231861 +0900
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--

The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.

Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.

Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.

The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674 
Closes #8675
2019-05-07 10:06:30 -07:00
Tomohiro Kusumi
75346937de Linux 5.1 compat: Drop ULLONG_MAX and LLONG_MAX definitions
Linux kernel commit 54d50897d544c874562253e2a8f70dfcad22afe8
"linux/kernel.h: split *_MAX and *_MIN macros into <linux/limits.h>"

which first appeared in 5.1 has moved several macros from
<linux/kernel.h> to <linux/limits.h>. This broke compilation due to
header inclusion order against the local header include/spl/sys/types.h
which also defines ULLONG_MAX and LLONG_MAX if undefined.

It looks like local ULLONG_MAX and LLONG_MAX were never needed
(or after spl integration ?) as <linux/kernel.h> has had the same
definitions since an upstream commit
111ebb6e6f7bd7de6d722c5848e95621f43700d9 in 2.6.18, so drop them.

--
linux/include/linux/limits.h:17: error: "LLONG_MAX" redefined [-Werror]
 #define LLONG_MAX ((long long)(~0ULL >> 1))
zfs/include/spl/sys/types.h:35: note: this is the location of the previous definition
 #define LLONG_MAX  ((long long)(~0ULL>>1))

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8714
2019-05-07 09:55:40 -07:00
Tomohiro Kusumi
de3e0b914b Linux 5.0 compat: Use totalhigh_pages()
Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839
"mm: convert totalram_pages and totalhigh_pages variables to atomic"

replaced `totalhigh_pages` with an inline function `totalhigh_pages()`.
This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages`
on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8677
Closes #8701
2019-05-04 16:40:48 -07:00
Tom Caputi
fa24166074 Add feature check for 'zpool resilver' command
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8700
2019-05-02 16:42:31 -07:00
Matthew Ahrens
6bdefad311 Remove incorrect (and inappropriate) comment in dprintf_dnode
This comment seems to misunderstand the ## preprocessor token, which
does token concatenation.  It is not needed here, since we are
concatenating string literals, which is performed by putting the
literals next to each other.

Additionally, the comment uses offensive language.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8698
Closes #8699
2019-05-01 17:32:54 -07:00
TerraTech
50478c6dad Add option [-V|--version] to emit version string
Add the 'zfs version' and 'zpool version' subcommands to display
the version of the user space utilities and loaded zfs kernel
module.  For example:

$ zfs version
zfs-0.8.0-rc3_169_g67e0366b88
zfs-kmod-0.8.0-rc3_169_g67e0366b88

The '-V' and '--version' aliases were added to support the
common convention of using 'zfs --version` to obtain the version
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com>
Closes #2501
Closes #8567
2019-04-16 12:24:06 -07:00
Richard Laager
8dda07b33c Reference zfeature.c in a SPA_VERSION comment
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:19 -07:00
Richard Laager
7698c4eca9 Remove zfs.h comments about GRUB
Nobody is going to be bumping SPA_VERSION again, as OpenZFS has moved on
to feature flags.  Also, there is no requirement to keep GRUB
up-to-date, nor has that been happening.

The ZPL_VERSION could be bumped, but that would likely be handled in a
similar way, by adding filesystem feature flags.  In any event, we do
not need this comment, and we certainly don't need a reference to the
GRUB 0.97 source code in a Solaris tree.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:14 -07:00
Tomohiro Kusumi
96e51d2773 Sync reserved Illumos ioctl comment with actual number
It's 81 now.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8598
2019-04-14 11:12:07 -07:00
Tomohiro Kusumi
9a65234c8b Unbreak build on Linux kernel < 3.10
d12614521a("Fixes for procfs files backed by linked lists")
uses PDE_DATA(), but since PDE_DATA() (public interface which
replaced old public interface PDE()) first appeared in upstream
kernel 3.10, it lacks visible local definition for kernel < 3.10.

Move the local PDE_DATA() definition to a ZoL header, to unbreak
build on kernel < 3.10.

--
module/spl/spl-procfs-list.c: In function 'procfs_list_open':
module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA'
module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8599
2019-04-08 14:59:24 -07:00
Sara Hartse
a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Brian Behlendorf
1b939560be
Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00
George Wilson
2efea7c82c ZFS Reads may result in unneccesary calls to zil_commit
ZFS supports O_RSYNC for read operations and when specified will ensure
the same level of data integrity that O_DSYNC and O_SYNC provides for
writes. O_RSYNC by itself has no effect so it must be combined with
either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC
and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect
and causes unnecessary calls to zil_commit. Only platforms which
support O_RSYNC should implement the zil_commit functionality in the
read code path.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8523
2019-03-22 13:09:11 -07:00
Olaf Faaland
060f0226e6 MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 12:47:57 -07:00
Tom Caputi
ab7615d92c Multiple DVA Scrubbing Fix
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.

This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.

To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8453
2019-03-15 14:14:31 -07:00
Tom Caputi
f00ab3f22c Detect and prevent mixed raw and non-raw sends
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.

This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.

This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8308
2019-03-13 11:00:43 -07:00
Tom Caputi
579ce7c5ae Add bookmark v2 on-disk feature
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
2019-03-13 10:58:39 -07:00
Tom Caputi
369aa501d1 Fix handling of maxblkid for raw sends
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.

For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8168 
Closes #8487
2019-03-13 10:52:01 -07:00
Olaf Faaland
db2af93d72 Increase default zfs_multihost_fail_intervals and import_intervals
By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write.  This is the recommended
configuration.

The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.

There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds.  This
in turn caused MMP to suspend the pool.

In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O.  The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.

Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.

Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor.  This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7709 
Closes #8495
2019-03-13 09:50:48 -07:00
Alek P
4c0883fb4a Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
2019-03-12 13:13:22 -07:00
Olaf Faaland
3d31aad83e MMP writes rotate over leaves
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes.  This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.

The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.

For example, in a pool like this:

       root-vdev
   +------+---------------+
vdev1                   vdev2
  |                       |
  |                +------+-----+-----+----+
disk1             disk2 disk3 disk4 disk5 disk6

vdev1 and vdev2 will each be chosen 50% of the time.  Every time vdev1
is chosen, disk1 will be chosen.  However, every time vdev2 is chosen,
disk2 is chosen 20% of the time.  As a result, disk1 will be sent 5x as
many MMP writes as disk2.

This may create wear issues in the case of SSDs.  It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.

The new code maintains a list of leaf vdevs in the pool.  MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf.  To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached.  It stops when a
suitable leaf is found or all leaves have been examined.

Added a test to verify MMP write distribution is even.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7953
2019-03-12 10:37:06 -07:00
Lorenz Brun
bf90948daf Reorder ZFS ioctls to fix cross-version compatibility
Reorder ZFS ioctls to fix cross-version compatibility.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Lorenz Brun <lorenz@dolansoft.org>
Closes #8484
2019-03-09 13:39:31 -08:00
Tony Hutter
becdcec7b9 kernel_fpu fixes
This patch fixes a few issues when detecting which kernel_fpu functions
are available.

- Use kernel_fpu_begin() if it's exported on newer kernels.

- Use ZFS_LINUX_TRY_COMPILE_SYMBOL() to choose the right kernel_fpu
  function when using --enable-linux-builtin.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8259
Closes #8363
2019-03-06 16:03:03 -08:00
Paul Zuchowski
a73e8fdb93 Stack overflow in recursive bpobj_iterate_impl
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7674
Closes #7675 
Closes #7908
2019-03-06 09:50:55 -08:00
lidongyang
8d9e51c084 Fix dnode_hold_impl() soft lockup
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes #8433
2019-02-22 09:48:37 -08:00
Serapheim Dimitropoulos
928e8ad47d Introduce auxiliary metaslab histograms
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.

The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram.  The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.

In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8358
2019-02-20 09:59:56 -08:00
Paul Zuchowski
9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Tony Hutter
e73ab1b38c Linux 4.20 compat: Fix VERIFY(RW_READ_HELD(&hash->mh_contents))
The 4.20 kernel changed the meaning of the rw_semaphore.owner bits,
causing an assertion when loading the module under the 4.20 kernel.
This patch fixes the issue.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8360 
Closes #8389
2019-02-15 12:37:20 -08:00
Alek P
dcec0a12c8 port async unlinked drain from illumos-nexenta
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.

Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8142
2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos
425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00
loli10K
d8d418ff0c ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2019-02-08 15:44:15 -08:00
Tony Hutter
0c593296e9 Linux 5.0 compat: Disable vector instructions on 5.0+ kernels
The 5.0 kernel no longer exports the functions we need to do vector
(SSE/SSE2/SSE3/AVX...) instructions.  Disable vector-based checksum
algorithms when building against those kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8259
2019-01-28 10:11:45 -08:00
Tony Hutter
031cea17a3 Linux 5.0 compat: Use totalram_pages()
totalram_pages() was converted to an atomic variable in 5.0:

https://patchwork.kernel.org/patch/10652795/

Its value should now be read though the totalram_pages() helper
function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8263
2019-01-28 10:11:14 -08:00
Tony Hutter
77e50c3070 Linux 5.0 compat: access_ok() drops 'type' parameter
access_ok no longer needs a 'type' parameter in the 5.0 kernel.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8261
2019-01-28 10:11:10 -08:00
Tony Hutter
5cb46f6a66 Linux 4.18 compat: Use ktime_get_coarse_real_ts64()
Newer kernels remove current_kernel_time64().  Use
ktime_get_coarse_real_ts64() in its place.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8258
2019-01-28 10:11:03 -08:00
Serapheim Dimitropoulos
df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8327
2019-01-25 09:51:24 -08:00
Serapheim Dimitropoulos
b194fab0fb Factor metaslab_load_wait() in metaslab_load()
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.

Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8290
2019-01-18 11:10:32 -08:00
Serapheim Dimitropoulos
1a759200e5 Document guidelines for usage of zfs_dbgmsg
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8299
2019-01-18 10:16:56 -08:00
Tom Caputi
305781da4b Fix error handling incallers of dbuf_hold_level()
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8291
2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos
75058f3303 Remove unused vdev_t fields
The following fields from the vdev_t struct are not used anywhere.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8285
2019-01-17 15:41:12 -08:00
Serapheim Dimitropoulos
61c3391acc Serialize ZTHR operations to eliminate races
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8229
2019-01-13 10:09:46 -08:00
Brian Behlendorf
6955b40138
Provide more flexible object allocation interface
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.

The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.

The zap_create_* functions have been updated to take advantage of
this new functionality.  The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().

The new symbols have been exported for use by Lustre.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8015
2019-01-10 14:37:43 -08:00
George Wilson
c10d37dd9f zfs initialize performance enhancements
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8230
2019-01-07 11:03:08 -08:00
George Wilson
619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
bunder2015
5365b0747a Add missing MMP status code to libzfs_status
When MMP was merged the status codes in libzfs_status were not
updated to add the status code for ZPOOL_STATUS_IO_FAILURE_MMP.  This
commit corrects this and adds comments to help keep track of which
code is used for which status.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8148
Closes #8222
2019-01-03 12:15:46 -08:00
Brad Lewis
3ec34e5527 OpenZFS 9284 - arc_reclaim_thread has 2 jobs
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping.  However, the code is still confusing and error-prone,
because this thread has two responsibilities.  We should instead
separate this into two threads each with their own responsibility:

 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

 2. keep enough free memory in the system, by calling
    `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
    `arc_available_memory()`.

Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by:  Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
2018-12-26 13:22:28 -08:00
Andriy Gapon
dc1c630b8a OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_core
Porting Notes:
* Additional changes to recv_rename_impl() were required due to
  encryption code not being merged in OpenZFS yet.
* libzfs_core python bindings (pyzfs) were updated to fully support
  both lzc_rename() and lzc_destroy()

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: loli10K <ezomori.nozomu@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9630
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63
Closes #8207
2018-12-14 09:49:45 -08:00
Prakash Surya
900d09b285 OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.

Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).

Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.

The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).

For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.

Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.

This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).

Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).

Porting Notes:
* Reconciled conflicts with the fastwrite feature.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-12-07 11:09:42 -08:00
Sebastien Roy
a10d50f999 OpenZFS 8115 - parallel zfs mount
Porting Notes:
* Use thread pools (tpool) API instead of introducing taskq interfaces
  to libzfs.
* Use pthread_mutext for locks as mutex_t isn't available.
* Ignore alternative libshare initialization since OpenZFS-7955 is
  not present on zfsonlinux.

Authored by: Sebastien Roy <seb@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Authored by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8115
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3f0e2b569
Closes #8092
2018-11-15 11:33:58 -08:00
loli10K
d48091de81 zed: detect and offline physically removed devices
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.

We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1537
Closes #7926
2018-11-09 11:17:24 -08:00
Tony Hutter
ad796b8a3b Add zpool status -s (slow I/Os) and -p (parseable)
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.

 	NAME         STATE     READ WRITE CKSUM  SLOW
 	testpool     ONLINE       0     0     0     -
	  mirror-0   ONLINE       0     0     0     -
 	    loop0    ONLINE       0     0     0    20
 	    loop1    ONLINE       0     0     0     0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7756
Closes #6885
2018-11-08 16:47:24 -08:00
Don Brady
95692927f2 Fix libudev dependency in libzutil
ZFS should be able to build without libudev installed. The recent
change for libzutil inadvertently broke that.  Make the libudev code
conditional in zutil_import.c to resolve the build failure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8097
2018-11-06 17:47:52 -08:00
Don Brady
e89f1295d4 Add libzutil for libzfs or libzpool consumers
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c).  This
removes the need for utilities to link against both libzpool and
libzfs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8050
2018-11-05 11:22:33 -08:00
Brian Behlendorf
82c0a050fc
Linux 4.20 compat: current_kernel_time()
Commit torvalds/linux@976516404 removed the current_kernel_time()
function (and several others).  All callers are expected to use
current_kernel_time64().  Update the gethrestime_sec() wrapper
accordingly.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8074
2018-10-31 11:50:42 -05:00
Brian Behlendorf
b3d7725c94
Remove zfs_gitrev.h
This generated file was accidentally included in previous commit,
80a91e7, and should not be included in the repository.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8054
2018-10-24 14:48:14 -07:00
Tom Caputi
ab4c009e3d Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:50 -07:00
Matthew Ahrens
d637db98e1 OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename()
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9681
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7
Closes #8041
2018-10-19 12:02:28 -07:00
Tom Caputi
80a91e7469 Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.

This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee 
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
2018-10-18 21:06:18 -07:00
Allan Jude
9f438c5f94 OpenZFS 9862 - fix typo in comment in vdev_impl.h
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tony Hutter <hutter2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9862
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/84927f52
Closes #8036
2018-10-18 15:09:27 -07:00
Matthew Ahrens
49394a7708 Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
2018-10-17 10:31:38 -07:00
Paul Dagnelie
d52d80b700 Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.

This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.

This functionality will be used by the Redacted Send/Receive feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
2018-10-16 11:15:04 -07:00
ilbsmart
779a6c0bf6 deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
2018-10-16 11:11:24 -07:00
Brian Behlendorf
27f80e85c2 Improved error handling for extreme rewinds
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing.  While this is the most common cause, it's not the only
cause.  Attemping to access a damaged ZAP will result in other
errors.

The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import.  Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully.  Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
2018-10-12 11:24:04 -07:00
Tony Hutter
3c94dd7b7b Define timestruc_t for Lustre compatibility
Lustre 2.8 (and possibly other versions) are still using timestruc_t,
which was removed in spl-0.7.10 in favor of inode_timespec_t.  Add
in a backwards compatibility #define for timestruc_t so that Lustre
builds.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8014
2018-10-12 11:13:34 -07:00
Matt Ahrens
5d43cc9a59 OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t.  However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.

We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.

Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
  This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
  check to prevent needless warnings when integrating changes
  which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
  needed to defer the cv_destroy() which was called immediately
  after cv_broadcast().  With d2733258 this should be safe but
  if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
  to this change in order to make it easier to apply OpenZFS 9689.
  - d88895a0, which removed the dummy znode from zvol_state
  - e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function.  Preserved the
  rangelock free list.  The cv_destroy() function will block waiting
  for all processes in cv_wait() to be scheduled and drop their
  reference.  This is done to ensure it's safe to free the condition
  variable.  However, blocking while holding the rl->rl_lock mutex
  can result in a deadlock on Linux.  A free list is introduced to
  defer the cv_destroy() and kmem_free() until after the mutex is
  released.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
2018-10-11 10:19:33 -07:00
Alek P
50a343d85c Fix changelist mounted-dataset iteration
Commit 0c6d093 caused a regression in the inherit codepath.
The fix is to restrict the changelist iteration on mountpoints and
add proper handling for 'legacy' mountpoints

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7988 
Closes #7991
2018-10-10 21:13:13 -07:00
Paul Dagnelie
0391690583 Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
2018-10-09 14:05:13 -07:00
Brian Behlendorf
d7e4b30a67 Add zfs_refcount_transfer_ownership_many()
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred.  Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;

This issue may occur when using encryption with refcount debugging
enabled.  An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder.  When unsharing the buffer the
p_abd should be transferred.

This issue does not impact production builds because refcount holders
are not tracked.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
2018-10-09 10:05:48 -07:00
Matthew Ahrens
4cbde2ecbf Create /proc/sys/kernel/spl/gitrev with git hash
The existing mechanisms for determining what code is running in the
kernel do not always correctly report the git hash.  The versions
reported there do not reflect changes made since `configure` was run
(i.e. incremental builds do not update the version) and they are
misleading if git tags are not set up properly.  This applies to
`modinfo zfs`, `dmesg`, and `/sys/module/zfs/version`.

There are complicated requirements on how the existing version is
generated.  Therefore we are leaving that alone, and adding a new
mechanism to record and retrieve the git hash:
`cat /proc/sys/kernel/spl/gitrev`

The gitrev is re-generated at compile time, when running `make`
(including for incremental builds).  The value is the output of `git
describe` (or "unknown" if not in a git repo or there are uncommitted
changes).

We're also removing /proc/sys/kernel/spl/version, which was never very
useful.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7931 
Closes #7965
2018-10-08 21:57:02 -07:00
Matthew Ahrens
dfbe267503 OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
  changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
2018-10-04 13:13:28 -07:00
Paul Dagnelie
6e8b268875 Fix ASSERT macros to not over-expand
The code reuse in the definitions of the ASSERT and VERIFY macros result
in expansion of their arguments before they are stringified, which
produces ugly and undesirable output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7884
2018-10-03 20:16:45 -07:00
Paul Dagnelie
95542372e6 Add new fnvlist_lookup_* functions
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7977
2018-10-03 15:30:55 -07:00
Tom Caputi
52ce99dd61 Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.

In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
2018-10-03 09:47:11 -07:00
Alek P
0c6d09361d changelist should be able to iter on mounts
Modified changelist_gather()ing for the mountpoint property.
Now instead of iterating on all dataset descendants, we read
/proc/self/mounts and iterate on the mounted descendant datasets only.

Switched changelist implementation from a uu_list_* to uu_avl_* in
order to  reduce changlist code-path's worst case time complexity.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7967
2018-10-02 12:30:58 -07:00
Tim Schumacher
424fd7c3e0 Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.

To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
2018-10-01 10:42:05 -07:00
Matthew Ahrens
fc23d59fa0 Remove duplicate macro in dsl_dir.h
The DD_FIELD_LAST_REMAP_TXG macro was added twice (with the same value).
This change removes one of them.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7968
2018-10-01 10:40:11 -07:00
John Gallagher
d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Tim Schumacher
c13060e478 Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.

Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.

Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885 
Closes #7932
2018-09-26 10:29:26 -07:00
Brian Behlendorf
e897a23eb1
Fix statfs(2) for 32-bit user space
When handling a 32-bit statfs() system call the returned fields,
although 64-bit in the kernel, must be limited to 32-bits or an
EOVERFLOW error will be returned.

This is less of an issue for block counts since the default
reported block size in 128KiB. But since it is possible to
set a smaller block size, these values will be scaled as
needed to fit in a 32-bit unsigned long.

Unlike most other filesystems the total possible file counts
are more likely to overflow because they are calculated based
on the available free space in the pool. In order to prevent
this the reported value must be capped at 2^32-1. This is
only for statfs(2) reporting, there are no changes to the
internal ZFS limits.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7927 
Closes #7122 
Closes #7937
2018-09-24 17:11:25 -07:00
Don Brady
73a5ec30bf Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module).  When not built as a module add
the sysfs entries under /sys/fs/zfs/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868 
Closes #7872
2018-09-06 21:44:52 -07:00
Don Brady
cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Don Brady
b83a0e2dc1 Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments 
for the ioc commands that rely on nvpair input arguments (i.e. non 
legacy commands from libzfs_core). Automatically verify the expected 
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on 
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE, 
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
2018-09-02 12:14:01 -07:00
Don Brady
e8bcb693d6 Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature 
and property attributes. The primary consumer of this information 
is user processes, like the zfs CLI, that need to know what the 
current loaded ZFS module supports. The libzfs binary will consult 
this information when instantiating the zfs and zpool property 
tables and the pool features table.

This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
  features.runtime
  features.pool
  properties.dataset
  properties.pool

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
2018-09-02 12:09:53 -07:00
Tom Caputi
47ab01a18f Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-27 10:16:28 -07:00
Tom Caputi
8c4fb36a24 Small rework of txg_list code
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7795
2018-08-27 10:16:01 -07:00
Serapheim Dimitropoulos
a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
Tom Caputi
1fff937a4c Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:49:19 -07:00
Tom Caputi
d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Matthew Ahrens
62840030a7 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
2018-08-02 15:51:45 -07:00
John Gallagher
499b5497cb Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866 
Closes #7761
2018-08-02 14:03:56 -07:00
Nathan Lewis
010d12474c Add support for selecting encryption backend
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
  that control the crypto implementation.  At the moment there is a
  choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
  15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
  with as key schedules generated with various implementations
  are not necessarily interchangable.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes #7102 
Closes #7103
2018-08-02 11:59:24 -07:00
George Wilson
3d503a76e8 Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.

This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
2018-08-02 10:21:48 -07:00
Paul Dagnelie
492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Don Brady
dae3e9ea21 OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
  zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
  were intentially not added as tunable module parameters.

OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef
Closes #7749
2018-07-30 11:30:41 -07:00
Serapheim Dimitropoulos
6b64382b17 OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation

While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.

An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists.  Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg

Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.

= What this patch does

In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).

= Performance Analysis

To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.

== Results (before patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        5.3
user        0.4
sys         2.3

[Trial 5]
real        8.2
user        0.4
sys         2.4

[Trial 6]
real        6.0
user        0.5
sys         2.3
```

== Results (after patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        4.9
user        0.0
sys         0.9

[Trial 5]
real        3.8
user        0.0
sys         0.9

[Trial 6]
real        3.6
user        0.0
sys         0.9
```

== Analysis

The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.

From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts.  Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().

From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).

Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
  zpool_do_events_nvprint().

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
2018-07-30 11:30:03 -07:00
Brian Behlendorf
11d0525cbb
Add rwsem_tryupgrade for 4.9.20-rt16 kernel
The RT rwsem implementation was changed to allow multiple readers
as of the 4.9.20-rt16 patch set.  This results in a build failure
because the existing implementation was forced to directly access
the rwsem structure which has changed.

While this could be accommodated by adding additional compatibility
code.  This patch resolves the build issue by simply assuming the
rwsem can never be upgraded.  This functionality is a performance
optimization and all callers must already handle this case.

Converting the last remaining use of __SPIN_LOCK_UNLOCKED to
spin_lock_init() was additionally required to get a clean build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7589
2018-07-30 09:22:30 -07:00
Matthew Ahrens
3a549dc7a1 OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
  and dmu_object_alloc_dnsize() functions to use a common shared
  dmu_object_alloc_impl() function.

OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b
Closes #7712
2018-07-25 14:11:35 -07:00
Feng Sun
750e1f88d3 Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
2018-07-25 09:52:27 -07:00
Brian Behlendorf
d441e85dd7
Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report the maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
2018-07-23 15:40:15 -07:00
Matthew Ahrens
2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Brian Behlendorf
ac09630d8b
Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Brian Behlendorf
33a19e0fd9
Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time.  Otherwise
fallback to using the sa_copy_data() function.

The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space.  In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642 
Closes #7684
2018-07-11 13:10:40 -07:00
Serapheim Dimitropoulos
a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Serapheim Dimitropoulos
4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Chunwei Chen
edf60b8645 Enforce PROP_ONETIME on zpool properties
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #7661
2018-06-28 14:49:17 -07:00
Serapheim Dimitropoulos
d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Brian Behlendorf
1c38ac61e1
Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212.  This change
resulted in the default elevator being used which can negatively
impact performance.

Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator.  Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().

Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269.  Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
2018-06-19 21:52:45 -07:00
Brian Behlendorf
6413c95fbd
Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
Tom Caputi
cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
John Gallagher
917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Brian Behlendorf
7b98f0d91f
Linux compat 4.18: check_disk_size_change()
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().

In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held.  Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently.  The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.

The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.

The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed.  This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.

Updated maximum Linux version in META file.  The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7611
2018-06-15 15:05:21 -07:00
Allan Jude
29445fe3a0 Reserve DMU_BACKUP_FEATURE for ZSTD
Reserve bit 25 for the ZSTD compression feature from FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #7626
2018-06-14 09:47:26 -07:00
Matthew Ahrens
1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Tom Caputi
e7504d7a18 Raw receive functions must not decrypt data
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7564 
Closes #7584 
Closes #7592
2018-06-06 10:16:41 -07:00
Paul Dagnelie
37fb3e4318 OpenZFS 8484 - Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752
Closes #7462
2018-06-06 09:35:59 -07:00
Tony Hutter
f0ed6c7448 Add pool state /proc entry, "SUSPENDED" pools
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7331 
Closes #7563
2018-06-06 09:33:54 -07:00
Brian Behlendorf
2d9142c9d4
Remove rwlock wrappers
The only remaining consumer of the rwlock compatibility wrappers
is ztest.  Remove the wrappers and convert the few remaining
calls to the underlying pthread functions.

    rwlock_init()    -> pthread_rwlock_init()
    rwlock_destroy() -> pthread_rwlock_destroy()
    rw_rdlock()      -> pthread_rwlock_rdlock()
    rw_wrlock()      -> pthread_rwlock_wrlock()
    rw_unlock()      -> pthread_rwlock_unlock()

Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE
which is equivilant to the USYNC_THREAD behavior.  There is no
functional change.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7591
2018-06-04 16:52:10 -07:00
Serapheim Dimitropoulos
e48afbc4eb OpenZFS 9464 - txg_kick() fails to see that we are quiescing
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes

Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:

    0ms   311MB in  4114ms (95% p1)  75MB/s  544MB (76%)  336us   153ms     0ms
    0ms     8MB in    51ms ( 0% p1) 163MB/s  474MB (66%)  129us    34ms     0ms
    0ms   366MB in  4454ms (93% p1)  82MB/s  572MB (79%)  498us    20ms     0ms
    0ms   406MB in  5212ms (95% p1)  77MB/s  591MB (82%)  661us    37ms     0ms
    0ms   340MB in  5110ms (94% p1)  66MB/s  622MB (86%) 1048us    41ms     1ms
    0ms     3MB in    61ms ( 0% p1)  51MB/s  419MB (58%)   33us     0ms     0ms
    0ms   361MB in  3555ms (88% p1) 101MB/s  542MB (75%)  335us    40ms     0ms
    0ms   356MB in  4592ms (92% p1)  77MB/s  561MB (78%)  430us    89ms     1ms
    0ms    11MB in   129ms (13% p1)  90MB/s  507MB (70%)  222us    15ms     0ms
    0ms   281MB in  2520ms (89% p1) 111MB/s  542MB (75%)  334us    42ms     0ms
    0ms   383MB in  3666ms (91% p1) 104MB/s  557MB (77%)  411us   133ms     0ms
    0ms   404MB in  5757ms (94% p1)  70MB/s  635MB (88%) 1274us   123ms     2ms
    4ms   367MB in  4172ms (89% p1)  88MB/s  556MB (77%)  401us    51ms     0ms
    0ms    42MB in   470ms (44% p1)  90MB/s  557MB (77%)  412us    43ms     0ms
    0ms   261MB in  2273ms (88% p1) 114MB/s  556MB (77%)  407us    27ms     0ms
    0ms   394MB in  3646ms (85% p1) 108MB/s  552MB (77%)  393us   304ms     0ms
    0ms   275MB in  2416ms (89% p1) 113MB/s  510MB (71%)  200us    53ms     0ms
    0ms     9MB in    53ms ( 0% p1) 169MB/s  483MB (67%)  140us   100ms     1ms

The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:

        if (tx->tx_syncing_txg == 0 &&
        tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
        tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
        tx->tx_quiesced_txg <= tx->tx_synced_txg) {
        tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
        cv_broadcast(&tx->tx_quiesce_more_cv);
    }

The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:

1] We have an open txg A that had enough dirty data (more than
   zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
   a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
   (and because of A's dirty data) and saw that we are not currently syncing
   any txg and no one has requested quiescing so it requests one by bumping
   tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
   request has been sent to it so it immediately grabs B without letting it
   gather enough data, putting it in a quiesced state and opening a new txg C.

In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.

Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b
Closes #7587
2018-06-04 14:56:06 -07:00
Pavel Zakharov
8a393be353 OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44
Closes #7532
2018-06-04 14:54:20 -07:00
Sara Hartse
74d42600d8 zpool reopen should detect expanded devices
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.

Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes #7546 
Issue #7582
2018-05-31 10:36:37 -07:00
Brian Behlendorf
93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Brian Behlendorf
1272941f49 Merge branch 'zfsonlinux/merge-spl'
Merge a minimal version of the zfsonlinux/spl repository in to the
zfsonlinux/zfs repository.  Care was taken to prevent file conflicts
when merging and to preserve the spl repository history.  The spl
kernel module remains under the GPLv2 license as documented by the
additional THIRDPARTYLICENSE.gplv2 file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:57:55 -07:00
Brian Behlendorf
a91258913f Prepare SPL repo to merge with ZFS repo
This commit removes everything from the repository except the core
SPL implementation for Linux.  Those files which remain have been
moved to non-conflicting locations to facilitate the merge.
The README.md and associated files have been updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:51:39 -07:00
Matthew Ahrens
0dc2f70c5c OpenZFS 9486 - reduce memory used by device removal on fragmented pools
Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

* Add the following as module parameters:
    * zfs_condense_indirect_vdevs_enable
    * zfs_condense_max_obsolete_bytes

* Document the following module parameters:
   * zfs_condense_indirect_vdevs_enable
   * zfs_condense_max_obsolete_bytes
   * zfs_condense_min_mapping_bytes

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes #7536
2018-05-24 10:18:07 -07:00
Pavel Zakharov
6cb8e5306d OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2018-05-08 21:35:27 -07:00
Pavel Zakharov
4a0ee12af8 OpenZFS 8961 - SPA load/import should tell us why it failed
Problem
=======

When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".

While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support.  For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.

Solution
========

Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0
Closes #7459
2018-05-08 21:30:10 -07:00
Tom Caputi
be9a5c355c Add support for decryption faults in zinject
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.

This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7474
2018-05-02 15:36:20 -07:00
Brian Behlendorf
9464b9591e
RHEL 7.5 compat: FMODE_KABI_ITERATE
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.

Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:

* Requires that callers intending to use this extended interface
  set the FMODE_KABI_ITERATE flag on the file structure when
  opening the directory.
* Adds the fops.iterate() method to the end of the structure,
  without removing fops.readdir().

This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected.  Instead fallback to
the fops.readdir() interface which will be available.

Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7460 
Closes #7463
2018-05-02 15:01:24 -07:00
Matthew Ahrens
964c2d69a9 OpenZFS 9236 - nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5

OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668
Closes #7467
2018-04-30 10:19:48 -07:00
Chunwei Chen
599b864813 Fix ENOSPC in "Handle zap_add() failures in ..."
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7401 
Closes #7421
2018-04-18 14:19:50 -07:00
Matthew Ahrens
0c03d21ac9 assertion in arc_release() during encrypted receive
In the existing code, when doing a raw (encrypted) zfs receive, 
we call arc_convert_to_raw() from open context. This creates a 
race condition between arc_release()/arc_change_state() and 
writing out the block from syncing context (arc_write_ready/done()).

This change makes it so that when we are doing a raw (encrypted) 
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode 
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() 
from syncing context when writing out the block of dnodes.

Additionally, we can eliminate dr_raw and associated setters, and 
instead know that dnode blocks are always raw when doing a zfs 
receive (see the new field os_raw_receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7424 
Closes #7429
2018-04-17 11:06:54 -07:00
Serapheim Dimitropoulos
9d5b524597 OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2018-04-14 12:23:53 -07:00
Matthew Ahrens
9e052db462 OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-04-14 12:21:39 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Seth Forshee
93b43af10d Allow mounting datasets more than once
Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:

- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset

The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)

The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.

Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes #5796
Closes #7207
2018-04-13 10:44:05 -07:00
Brian Behlendorf
d6bb22171b
Linux compat 4.16: blk_queue_flag_{set,clear}
The HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY case was overlooked in
the original 10f88c5c commit because blk_queue_write_cache()
was available for the in-kernel builds.

Update the blk_queue_flag_{set,clear} wrappers to call the locked
versions to avoid confusion.  This is safe for all existing callers.

The blk_queue_set_write_cache() function has been updated to use
these wrappers.  This means setting/clearing both QUEUE_FLAG_WC
and QUEUE_FLAG_FUA is no longer atomic but this only done early
in zvol_alloc() prior to any requests so there is no issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7428 
Closes #7431
2018-04-12 19:46:14 -07:00
Tom Caputi
edc1e713c2 Fix race in dnode_check_slots_free()
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147 
Closes #7388
2018-04-10 11:15:05 -07:00
Giuseppe Di Natale
10f88c5cd5 Linux compat 4.16: blk_queue_flag_{set,clear}
queue_flag_{set,clear}_unlocked are now private interfaces in
the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14).
Use blk_queue_flag_{set,clear} interfaces which were introduced as
of https://github.com/torvalds/linux/commit/8814ce8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7410
2018-04-10 10:32:14 -07:00
Giuseppe Di Natale
9125f8f5bd Linux compat 4.16: SECTOR_SIZE
As of https://github.com/torvalds/linux/commit/233bde21,
SECTOR_SIZE is defined in linux/blkdev.h. Define SECTOR_SIZE
in sunldi.h only if it's not already defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #697
2018-04-09 17:20:06 -07:00
Tony Hutter
4f301661df Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
2018-04-09 14:24:46 -07:00
Matthew Ahrens
5c27ec1088 Fixes for SNPRINTF_BLKPTR with encrypted BP's
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390
2018-04-06 13:30:26 -07:00
Brian Behlendorf
581bc01a07
Remove sysevents
These headers are provided in the ZFS repository and never used
by the SPL.  Remove them to ensure the right ones are included.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #696
2018-04-04 09:54:20 -07:00
Tom Caputi
a2c2ed1bd4 Decryption error handling improvements
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372
2018-03-31 11:12:51 -07:00
LOLi
77d8a0f1a4 Fix hung z_zvol tasks during 'zfs receive'
During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue.  This results in hung tasks:

<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3>      Tainted: P           O  3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260

root@linux:~# cat /proc/spl/taskq
taskq                       act  nthr  spwn  maxt   pri  mina
spl_system_taskq/0            1     2     0    64   100     1
	active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
	wait: 5541
spl_delay_taskq/0             0     1     0     4   100     1
	delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1                      1     1     0     1   120     1
	active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
	pend: zvol_task_cb [zfs](0xffff88001fde6800)

This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6330 
Closes #6890 
Closes #7343
2018-03-30 12:10:01 -07:00
Andriy Gapon
5e00213e43 OpenZFS 9164 - assert: newds == os->os_dsl_dataset
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
  to successfully run in under 5 minutes.

OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a
Closes #6112
Closes #7336
2018-03-30 12:00:40 -07:00
Matthew Ahrens
2fd92c3d6c enable zfs_dbgmsg() by default, without dprintf()
zfs_dbgmsg() should record a message by default.  As a general
principal, these messages shouldn't be too verbose.  Furthermore, the
amount of memory used is limited to 4MB (by default).

dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags.  This flag is not set by default
(even on debug builds).  These messages are extremely verbose, and
sometimes nontrivial to compute.

SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags.  This flag is not set by default (even on debug builds).

This brings our behavior in line with illumos.  Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7314
2018-03-21 15:37:32 -07:00
Tom Caputi
089fbf313c Add comments for portable dnode / objset flags
This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7313
2018-03-20 11:55:21 -07:00
Alek P
272b5d730f Add JSON output support to channel programs
The changes piggyback JSON output support on top of channel programs 
(#6558).  This way the JSON output support is targeted to scripting 
use cases and is easily maintainable since it really only touches 
one function (zfs_do_channel_program()).

This patch ports Joyent's JSON nvlist library from illumos to enable 
easy JSON printing of channel program output nvlist.  To keep the 
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting 
the program.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7281
2018-03-19 12:40:58 -07:00
Olaf Faaland
cec3a0a1bb Report pool suspended due to MMP
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296
2018-03-15 10:56:55 -07:00
Chunwei Chen
9470cbd4f9 Fix race in trace point in zrl_add_impl
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
				        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291
2018-03-12 11:27:02 -07:00
Tomohiro Kusumi
6b8655ad3f Change functions which return literals to return const char*
get_format_prompt_string() and zpool_state_to_name() return
a string literal which is read-only, thus they should return
`const char*`.

zpool_get_prop_string() returns a non-const string after
successful nv-lookup, and returns a string literal otherwise.
Since this function is designed to be used for read-only purpose,
the return type should also be `const char*`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7285
2018-03-09 13:47:32 -08:00
Olaf Faaland
d2160d0538 Record skipped MMP writes in multihost_history
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:15:15 -08:00
Giuseppe Di Natale
dd3e1e3083 Linux 4.16 compat: get_disk_and_module()
As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk()
is now get_disk_and_module(). Add a configure check to determine
if we need to use get_disk_and_module().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7264
2018-03-05 12:44:35 -08:00
Tony Hutter
80d52c3919 Change checksum & IO delay ratelimit values
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252
2018-03-04 17:34:51 -08:00
Tom Caputi
095495e008 Raw DRR_OBJECT records must write raw data
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215 
Closes #7236
2018-02-27 09:04:05 -08:00
Tim Chase
8b5814393f Incorrect maximum DVA value in DDE_GET_NDVAS()
The conditional was reversed which caused garbage values to be used when
calculating dds_ref_dsize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7234
2018-02-26 14:20:12 -08:00
chrisrd
e9a7729008 Fix free memory calculation on v3.14+
Provide infrastructure to auto-configure to enum and API changes in the
global page stats used for our free memory calculations.

arc_free_memory has been broken since an API change in Linux v3.14:

2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node
2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node
  vmstats

These commits moved some of global_page_state() into
global_node_page_state(). The API change was particularly egregious as,
instead of breaking the old code, it silently did the wrong thing and we
continued using global_page_state() where we should have been using
global_node_page_state(), thus indexing into the wrong array via
NR_SLAB_RECLAIMABLE et al.

There have been further API changes along the way:

2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to
  node counters
2017-09-06 v4.14 c41f012a mm: rename global_page_state to
  global_zone_page_state

...and various (incomplete, as it turns out) attempts to accomodate
these changes in ZoL:

2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats
2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc
2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc

The config infrastructure provided here resolves these issues going back
to the original API change in v3.14 and is robust against further Linux
changes in this area.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7170
2018-02-23 08:50:06 -08:00
Olaf Faaland
7088545d01 Report duration and error in mmp_history entries
After an MMP write completes, update the relevant mmp_history entry
with the time between submission and completion, and the error
status of the write.

[faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost
39 0 0x01 100 8800 69147946270893 72723903122926
id       txg     timestamp  error  duration   mmp_delay    vdev_guid
10607    1166    1518985089 0      138301     637785455    4882...
10608    1166    1518985089 0      136154     635407747    1151...
10609    1166    1518985089 0      803618560  633048078    9740...
10610    1166    1518985090 0      144826     633048078    4882...
10611    1166    1518985090 0      164527     666187671    1151...

Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and
error = zio->io_error.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7190
2018-02-22 15:34:34 -08:00
Tony Hutter
a5369b61a2 Linux 4.16 compat: use correct *_dec_and_test()
Use refcount_dec_and_test() on 4.16+ kernels, atomic_dec_and_test()
on older kernels.  https://lwn.net/Articles/714974/

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #7179 
Closes: #7211
2018-02-22 09:02:06 -08:00
DeHackEd
2b5cd5990f Fix multiple evaluations of VERIFY() and ASSERT() on failures
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #684 
Closes #685
2018-02-21 14:54:26 -08:00
LOLi
faa97c1619 Want 'zfs send -b'
This change implements 'zfs send -b' which can be used to send only
received property values whether or not they are overridden by local
settings.

This can be very useful during "restore" operations from a backup pool
because it allows to send only the property values originally sent
from the backup source, even though they were later modified on the
destination either by a 'zfs set' operation, explicit 'zfs inherit' or
overridden during the receive process via 'zfs receive -o|-x'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7156
2018-02-21 12:32:06 -08:00
Tom Caputi
b0918402dc Raw receive should change key atomically
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7200
2018-02-21 12:31:03 -08:00
Tom Caputi
b1d217338a Raw receives must compress metadnode blocks
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.

This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7198
2018-02-21 12:28:52 -08:00
Tom Caputi
163a8c28dd ZIL claiming should not start user accounting
Currently, ZIL claiming dirties objsets which causes
dsl_pool_sync() to attempt to perform user accounting on
them. This causes problems for encrypted datasets that were
raw received before the system went offline since they
cannot perform user accounting until they have their keys
loaded. This triggers an ASSERT in zio_encrypt(). Since
encryption was added, the code now depends on the fact that
data should only be written when objsets are owned. This
patch adds a check in dmu_objset_do_userquota_updates()
to ensure that useraccounting is only done when the objsets
are actually owned for write. As part of this work, the
zfsvfs and zvol code was updated so that it no longer lies
about owning objsets readonly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6916 
Closes #7163
2018-02-20 16:27:31 -08:00
Don Brady
cbce581353 Fix coverity defects: zfs channel programs
CID 173243, 173245:  Memory - corruptions  (OVERRUN)
 Added size argument to lcompat_sprintf() to avoid use of INT_MAX

CID 173244:  Integer handling issues  (OVERFLOW_BEFORE_WIDEN)
 Added cast to uint64_t to avoid a 32 bit overflow warning

CID 173242:  Integer handling issues  (CONSTANT_EXPRESSION_RESULT)
 Conditionally removed unused luai_numisnan() floating point check

CID 173241:  Resource leaks  (RESOURCE_LEAK)
 Added missing close(fd) on error path

CID 173240:    (UNINIT)
Fixed uninitialized variable in get_special_prop()

CID 147560:  Null pointer dereferences  (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()

CID 28475:  Memory - illegal accesses  (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater

CID 28418, 28422:  Error handling issues  (CHECKED_RETURN)
Added function result cast to (void) to avoid warning

CID 23935, 28411, 28412:  Memory - corruptions  (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7181
2018-02-20 11:19:42 -08:00
George Wilson
ddc751d56b OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio
PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c
Issue #5918
Closes #7168
2018-02-14 15:30:09 -08:00
Nasf-Fan
9c5167d19f Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.

Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').

By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.

Support the following commands and functionalities:

zfs set projectquota@project
zfs set projectobjquota@project

zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project

zfs projectspace

zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused

zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused

chattr +/-P
chattr -p project_id
lsattr -p

This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>

For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by  Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
2018-02-13 14:54:54 -08:00
sanjeevbagewadi
cc63068e95 Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #7011 
Closes #7054
2018-02-09 10:15:53 -08:00
Brian Behlendorf
f54976dc88 Linux 4.11 compat: avoid refcount_t name conflict
Related to commit 4859fe796, when directly using the kernel's
refcount functions in kernel compatibility code do not map
refcount_t to zfs_refcount_t.  This leads to a type mismatch.

Longer term we should consider renaming refcount_t to
zfs_refcount_t in the zfs code base.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:51 -08:00
Brian Behlendorf
18f57327e0 Linux 4.16 compat: inode_set_iversion()
A new interface was added to manipulate the version field of an
inode.  Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.

The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos
5b72a38d68 OpenZFS 8677 - Open-Context Channel Programs
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.

=== What is implemented?

This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
  to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
  the context.
- Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

=== How do we handle zfs.sync functions in open context?

When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...

arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg

if we run it in open context, we will get back the following error:

Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
            [C]: in function 'destroy'
            [string "channel program"]:3: in main chunk

=== What about testing?

We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.

OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15
Closes #6558
2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos
8d103d8856 OpenZFS 8604 - Simplify snapshots unmounting code
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).

The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.

With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.

A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/

OpenZFS-issue: https://www.illumos.org/issues/8604
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb
2018-02-08 15:29:44 -08:00
Chris Williamson
234c91c508 OpenZFS 8600 - ZFS channel programs - snapshot
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.

OpenZFS-issue: https://www.illumos.org/issues/8600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b
2018-02-08 15:29:24 -08:00
Brad Lewis
af07368986 OpenZFS 8592 - ZFS channel programs - rollback
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to perform a rollback.

OpenZFS-issue: https://www.illumos.org/issues/8592
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6
2018-02-08 15:29:14 -08:00
Chris Williamson
d99a015343 OpenZFS 7431 - ZFS Channel Programs
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533

Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
  Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
2018-02-08 15:28:18 -08:00
Brian Behlendorf
5461eefe50
Fix cstyle warnings
This patch contains no functional changes.  It is solely intended
to resolve cstyle warnings in order to facilitate moving the spl
source code in to the zfs repository.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #681
2018-02-07 11:49:38 -08:00
Tom Caputi
1b66810bad Change os->os_next_write_raw to work per txg
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.

This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
2018-02-02 11:44:53 -08:00
Tom Caputi
047116ac76 Raw sends must be able to decrease nlevels
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.

This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.

This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6821
Closes #6864
2018-02-02 11:43:11 -08:00
Tom Caputi
ae76f45cda Encryption Stability and On-Disk Format Fixes
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.

Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.

This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.

This patch also contains minor bug fixes and cleanups.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6845
Closes #6864
Closes #7052
2018-02-02 11:37:16 -08:00
Dr. András Korn
4c46b99d24 tx_waited -> tx_dirty_delayed in trace_dmu.h
This change was missed in 0735ecb334.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: András Korn <korn-github.com@elan.rulez.org>
Closes #7096
2018-01-31 16:13:26 -08:00
Prakash Surya
0735ecb334 OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue
PROBLEM
=======

When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.

If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.

If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:

    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);

In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.

SOLUTION
========

This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.

Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.

With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.

Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`

OpenZFS-issue: https://www.illumos.org/issues/8997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9
Closes #7084
2018-01-26 20:19:46 -08:00
Brian Behlendorf
8fb1ede146 Extend deadman logic
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems.  The proposed changes include:

* Added a new `zfs_deadman_failmode` module option which is
  used to dynamically control the behavior of the deadman.  It's
  loosely modeled after, but independant from, the pool failmode
  property.  It can be set to wait, continue, or panic.

    * wait     - Wait for the "hung" I/O (default)
    * continue - Attempt to recover from a "hung" I/O
    * panic    - Panic the system

* Added a new `zfs_deadman_ziotime_ms` module option which is
  analogous to `zfs_deadman_synctime_ms` except instead of
  applying to a pool TXG sync it applies to zio_wait().  A
  default value of 300s is used to define a "hung" zio.

* The ztest deadman thread has been re-enabled by default,
  aligned with the upstream OpenZFS code, and then extended
  to terminate the process when it takes significantly longer
  to complete than expected.

* The -G option was added to ztest to print the internal debug
  log when a fatal error is encountered.  This same option was
  previously added to zdb in commit fa603f82.  Update zloop.sh
  to unconditionally pass -G to obtain additional debugging.

* The FM_EREPORT_ZFS_DELAY event which was previously posted
  when the deadman detect a "hung" pool has been replaced by
  a new dedicated FM_EREPORT_ZFS_DEADMAN event.

* The proposed recovery logic attempts to restart a "hung"
  zio by calling zio_interrupt() on any outstanding leaf zios.
  We may want to further restrict this to zios in either the
  ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
  Calling zio_interrupt() is expected to only be useful for
  cases when an IO has been submitted to the physical device
  but for some reasonable the completion callback hasn't been
  called by the lower layers.  This shouldn't be possible but
  has been observed and may be caused by kernel/driver bugs.

* The 'zfs_deadman_synctime_ms' default value was reduced from
  1000s to 600s.

* Depending on how ztest fails there may be no cache file to
  move.  This should not be considered fatal, collect the logs
  which are available and carry on.

* Add deadman test cases for spa_deadman() and zio_wait().

* Increase default zfs_deadman_checktime_ms to 60s.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:40:38 -08:00
Brian Behlendorf
23602fdb39
Add cv_timedwait_io()
Add missing helper function cv_timedwait_io(), it should be used
when waiting on IO with a specified timeout.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #674
2018-01-24 11:33:47 -08:00
Brian Behlendorf
31864e3d8c
OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL
usr/src/uts/common/sys/fs/zfs.h
	Change ZPROP_INVAL and ZPROP_CONT from macros to enum values.  Clang
	and GCC both prefer to use unsigned ints to store enums.  That was
	causing tautological comparison warnings (and likely eliminating
	error handling code at compile time) whenever a zfs_prop_t or
	zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT.  Making the
	error flags be explicity enum values forces the enum types to be
	signed.

	ZPROP_INVAL was also compared against two different enum types.  I
	had to change its name to ZPOOL_PROP_INVAL whenever its compared to
	a zpool_prop_t.  There are still some places where ZPROP_INVAL or
	ZPROP_CONT is compared to a plain int, in code that doesn't know
	whether the int is storing a zfs_prop_t or a zpool_prop_t.

usr/src/uts/common/fs/zfs/spa.c
	s/ZPROP_INVAL/ZPOOL_PROP_INVAL/

Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74
Closes #7061
2018-01-19 09:22:37 -08:00
Sean Eric Fagan
43cb30b3ce OpenZFS 8959 - Add notifications when a scrub is paused or resumed
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.

OpenZFS-issue: https://www.illumos.org/issues/8959
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea
Closes #7049
2018-01-17 10:31:00 -08:00
Brian Behlendorf
e1a0850c35
Force ztest to always use /dev/urandom
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable.  Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7017 
Closes #7036
2018-01-12 09:36:26 -08:00
Nathaniel Wesley Filardo
cba6fc61a2 Revert raidz_map and _col structure types
As part of the refactoring of ab9f4b0b82,
several uint64_t-s and uint8_t-s were changed to other types.  This
caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM
machine.  In absense of any strong motivation for the type changes, this
simply puts them back, modulo the changes accumulated for ABD.

Compile-tested on amd64 and run-tested on armhf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes #6981 
Closes #7023
2018-01-09 14:46:52 -08:00
Brian Behlendorf
0873bb6337
Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6171 
Closes #6852 
Closes #6989
2018-01-08 09:52:36 -08:00
Prakash Surya
2fe61a7ecc OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-28 10:18:04 -08:00
lidongyang
823d48bfb1 Call commit callbacks from the tail of the list
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
2017-12-22 10:19:51 -08:00
Tom Caputi
a8b2e30685 Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.

This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921 
Closes #6926
2017-12-21 09:13:06 -08:00
Brian Behlendorf
c28a67733c
Suppress incorrect objtool warnings
Suppress incorrect warnings from versions of objtool which are not
aware of x86 EVEX prefix instructions used for AVX512.

  module/zfs/vdev_raidz_math_avx512bw.o: warning:
  objtool: <func+offset>: can't find jump dest instruction at .text

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6928
2017-12-07 10:28:50 -08:00
Prakash Surya
1b2b0acab5 OpenZFS 8603 - rename zilog's "zl_writer_lock" to "zl_issuer_lock"
This is a purely cosmetic change. The zilog's "zl_writer_lock" field is
being renamed to "zl_issuer_lock" to try and make the code easier to
understand; no other changes are made.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8603
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2daf06546b
Closes #6927
2017-12-06 11:38:10 -08:00
Prakash Surya
1ce23dcaff OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 09:39:16 -08:00
Brian Behlendorf
72841b9fd9
Preserve itx alloc size for zio_data_buf_free()
Using zio_data_buf_alloc() to allocate the itx's may be unsafe
because the itx->itx_lr.lrc_reclen field is not constant from
allocation to free.  Using a different itx->itx_lr.lrc_reclen
size in zio_data_buf_free() can result in the allocation being
returned to the wrong kmem cache.

This issue can be avoided entirely by storing the allocation size
in itx->itx_size and using that for zio_data_buf_free().

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6912
2017-12-04 11:44:39 -08:00
Tom Caputi
d4677269f2 Unbreak the scan status ABI
When d4a72f23 was merged, pss_pass_issued was incorrectly
added to the middle of the pool_scan_stat_t structure
instead of the end. This patch simply corrects this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6909
2017-11-30 09:40:13 -08:00
Tom Caputi
d4a72f2386 Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #3625 
Closes #6256
2017-11-15 17:27:01 -08:00
Brian Behlendorf
ed19bccfb6
Linux 4.14 compat: vfs_read & vfs_write
The kernel_read & kernel_write functions have always wrapped the
vfs_read & vfs_write functions respectively.  However, they could
not be used by vn_rdwr() since the offset wasn't passed as a
pointer.  This prevented us from being able to properly update
the file offset.

Linux 4.14 unexported vfs_read & vfs_write but also changed the
signature of kernel_read & kernel_write to provide the needed
functionality.  Use these updated functions when available.

Reviewed-by: Pritam Baral <pritam@pritambaral.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #656 
Closes #667
2017-11-15 17:19:23 -08:00
benrubson
7c351e31d5 OpenZFS 7531 - Assign correct flags to prefetched buffers
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: abraunegg <alex.braunegg@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb
2017-11-11 20:24:34 -08:00
Don Brady
1c27024e22 Undo c89 workarounds to match with upstream
With PR 5756 the zfs module now supports c99 and the
remaining past c89 workarounds can be undone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6816
2017-11-04 13:25:13 -07:00
Jason King
f3c8c9e6f0 OpenZFS 640 - number_to_scaled_string is duplicated in several commands
Porting Notes:
- The OpenZFS patch added nicenum_scale() and nicenum() to a
  library not used by ZFS.  Rather than pull in a new dependency
  the version of nicenum in lib/libzpool/util.c was simply
  replaced with the new one.

Reviewed by: Sebastian Wiedenroth <wiedi@frubar.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Authored by: Jason King <jason.brian.king@gmail.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/640
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0a055120
Closes #6796
2017-10-30 14:47:20 -07:00
Brian Behlendorf
8be3688999
Remove vn_rename and vn_remove
Both vn_rename and vn_remove have been historically problematic
to implement reliably.  Rather than fixing them yet again they
are being removed.

Reviewed-by: Arkadiusz Bubala <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648 
Closes #661
2017-10-27 15:49:14 -07:00
Brian Behlendorf
867959b588
OpenZFS 8081 - Compiler warnings in zdb
Fix compiler warnings in zdb.  With these changes, FreeBSD can compile
zdb with all compiler warnings enabled save -Wunused-parameter.

usr/src/cmd/zdb/zdb.c
usr/src/cmd/zdb/zdb_il.c
usr/src/uts/common/fs/zfs/sys/sa.h
usr/src/uts/common/fs/zfs/sys/spa.h
	Fix numerous warnings, including:
	* const-correctness
	* shadowing global definitions
	* signed vs unsigned comparisons
	* missing prototypes, or missing static declarations
	* unused variables and functions
	* Unreadable array initializations
	* Missing struct initializers

usr/src/cmd/zdb/zdb.h
	Add a header file to declare common symbols

usr/src/lib/libzpool/common/sys/zfs_context.h
usr/src/uts/common/fs/zfs/arc.c
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/spa.c
usr/src/uts/common/fs/zfs/txg.c
	Add a function prototype for zk_thread_create, and ensure that every
	callback supplied to this function actually matches the prototype.

usr/src/cmd/ztest/ztest.c
usr/src/uts/common/fs/zfs/sys/zil.h
usr/src/uts/common/fs/zfs/zfs_replay.c
usr/src/uts/common/fs/zfs/zvol.c
	Add a function prototype for zil_replay_func_t, and ensure that
	every function of this type actually matches the prototype.

usr/src/uts/common/fs/zfs/sys/refcount.h
	Change FTAG so it discards any constness of __func__, necessary
	since existing APIs expect it passed as void *.

Porting Notes:
- Many of these fixes have already been applied to Linux.  For
  consistency the OpenZFS version of a change was applied if the
  warning was addressed in an equivalent but different fashion.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8081
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8a
Closes #6787
2017-10-27 12:46:35 -07:00
Brian Behlendorf
a032ac4b38 OpenZFS 8558, 8602 - lwp_create() returns EAGAIN
8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems

On a system with more than 80K ZFS filesystems, we've seen cases
where lwp_create() will start to fail by returning EAGAIN. The
problem being, for each of those 80K ZFS filesystems, a taskq will
be created for each dataset as part of the ZIL for each dataset.

Porting Notes:
- The new nomem taskq kstat was dropped.
- Added module options and documentation for new tunings
  zfs_zil_clean_taskq_nthr_pct, zfs_zil_clean_taskq_minalloc,
  zfs_zil_clean_taskq_maxalloc, and zfs_sync_taskq_batch_pct.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8558
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/216d772

8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8602
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2bcb545
Closes #6779
2017-10-26 12:57:53 -07:00
Arkadiusz Bubała
d3f2cd7e3b Added no_scrub_restart flag to zpool reopen
Added -n flag to zpool reopen that allows a running scrub
operation to continue if there is a device with Dirty Time Log.

By default if a component device has a DTL and zpool reopen
is executed all running scan operations will be restarted.

Added functional tests for `zpool reopen`

Tests covers following scenarios:
* `zpool reopen` without arguments,
* `zpool reopen` with pool name as argument,
* `zpool reopen` while scrubbing,
* `zpool reopen -n` while scrubbing,
* `zpool reopen -n` while resilvering,
* `zpool reopen` with bad arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6076 
Closes #6746
2017-10-26 12:26:09 -07:00
Brian Behlendorf
d5e024cba2 Emit history events for 'zpool create'
History commands and events were being suppressed for the
'zpool create' command since the history object did not
yet exist.  Create the object earlier so this history
doesn't get lost.

Split the pool_destroy event in to pool_destroy and
pool_export so they may be distinguished.

Updated events_001_pos and events_002_pos test cases.  They
now check for the expected history events and were reworked
to be more reliable.

Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6712 
Closes #6486
2017-10-23 09:45:59 -07:00
John
6044cf59cd Add convenience 'zfs_get' functions
Add get functions to match existing ones.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Ramsden <johnramsden@riseup.net>
Closes #6308
2017-10-19 11:18:42 -07:00
Brian Behlendorf
21a932b83c Post-Encryption Followup
This PR includes fixes for bugs and documentation issues found 
after the encryption patch was merged and general code improvements 
for long-term maintainability.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6526
Closes #6639
Closes #6703
Cloese #6706
Closes #6714
Closes #6595
2017-10-13 10:02:39 -07:00
Damian Wojsław
cdc15a7604 Typo in dsl_dataset.h
The parameters dsl_dataset_t *os in function prototype should be
renamed to dsl_dataset_t *ds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Damian Wojsław <damian@wojslaw.pl>
Closes #6756 
Closes #6273
2017-10-12 17:10:38 -07:00
Tom Caputi
440a3eb939 Fixes for #6639
Several issues were uncovered by running stress tests with zfs
encryption and raw sends in particular. The issues and their
associated fixes are as follows:

* arc_read_done() has the ability to chain several requests for
  the same block of data via the arc_callback_t struct. In these
  cases, the ARC would only use the first request's dsobj from
  the bookmark to decrypt the data. This is problematic because
  the first request might be a prefetch zio which is able to
  handle the key not being loaded, while the second might use a
  different key that it is sure will work. The fix here is to
  pass the dsobj with each individual arc_callback_t so that each
  request can attempt to decrypt the data separately.

* DRR_FREE and DRR_FREEOBJECT records in a send file were not
  having their transactions properly tagged as raw during raw
  sends, which caused a panic when the dbuf code attempted to
  decrypt these blocks.

* traverse_prefetch_metadata() did not properly set
  ZIO_FLAG_SPECULATIVE when issuing prefetch IOs.

* Added a few asserts and code cleanups to ensure these issues
  are more detectable in the future.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:55:50 -04:00
Tom Caputi
4807c0badb Encryption patch follow-up
* PBKDF2 implementation changed to OpenSSL implementation.

* HKDF implementation moved to its own file and tests
  added to ensure correctness.

* Removed libzfs's now unnecessary dependency on libzpool
  and libicp.

* Ztest can now create and test encrypted datasets. This is
  currently disabled until issue #6526 is resolved, but
  otherwise functions as advertised.

* Several small bug fixes discovered after enabling ztest
  to run on encrypted datasets.

* Fixed coverity defects added by the encryption patch.

* Updated man pages for encrypted send / receive behavior.

* Fixed a bug where encrypted datasets could receive
  DRR_WRITE_EMBEDDED records.

* Minor code cleanups / consolidation.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:54:48 -04:00
Fabian Grünbichler
48fbb9ddbf Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream.  Free them accordingly, as if a
freeobjects record for them had been included in the stream.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #5699
Closes #6507
Closes #6616
2017-10-10 15:30:51 -07:00
Brian Behlendorf
0cefc9dbcd Add parenthesis to btop and ptob macros
Add missing parenthesis around btop and ptob macros to ensure
operation ordering is preserved after expansion.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660
2017-10-10 08:59:17 -07:00
Olaf Faaland
ce319db57b Make include/linux/ conform to ZFS style standard
No semantic changes.

Fix the following types of style issues:
	blank after preprocessor #
	#define followed by space instead of tab
	improper first line of block comment
	indent by spaces instead of tabs
	last line in file is blank
	missing blank after open comment
	missing space before left brace
	non-continuation indented 4 spaces
	spaces instead of tabs
	unparenthesized return expression

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-10-09 14:27:27 -07:00
Olaf Faaland
4b393c50ae Make file headers conform to ZFS style standard
No semantic changes.

Change
 /************\
and
 \************/

to

 /*
and
  */

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-10-09 14:27:27 -07:00
Brian Behlendorf
c11f1004d1 Remove dead code from AVL tree
The avl_update_* functions are never used by ZFS and are therefore
being removed.  They're barely even used in Illumos.  Additionally,
simplify avl_add() by using a VERIFY which produces exactly the same
behavior under Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6716
2017-10-05 19:28:00 -07:00
chrisrd
e71cade67d Scale the dbuf cache with arc_c
Commit d3c2ae1 introduced a dbuf cache with a default size of the
minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted
using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache
is counted as metadata for the purposes of ARC size calculations.

On a 1GB box the ARC maximum size defaults to c_max 493M which gives a
dbuf cache default minimum size of 15.4M, and the ARC metadata defaults
to minimum 16M. I.e. the dbuf cache is an significant proportion of the
minimum metadata size. With other overheads involved this actually means
the ARC metadata doesn't get down to the minimum.

This patch dynamically scales the dbuf cache to the target ARC size
instead of statically scaling it to the maximum ARC size. (The scale is
still set by dbuf_cache_max_shift and the maximum size is still fixed by
dbuf_cache_max_bytes.) Using the target ARC size rather than the current
ARC size is done to help the ARC reach the target rather than simply
focusing on the current size.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #6506 
Closes #6561
2017-09-29 15:49:19 -07:00
Giuseppe Di Natale
34d00e7aba Correct cppcheck errors
ZFS buildbot STYLE builder was moved to Ubuntu 17.04
which has a newer version of cppcheck. Handle the
new cppcheck errors.

uu_* functions removed in this commit were unused
and effectively dead code. They are now retired.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6653
2017-09-19 12:17:29 -07:00
David Quigley
a9a2bf7152 Remove FRU and LIBTOPO Support
FRU and LIBTOPO support are illumos only features that will not be ported to
Linux and make the code more complicated than necessary. This commit
makes way for further cleanups of the zed/FMA code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6641
2017-09-18 17:06:40 -07:00
Giuseppe Di Natale
787acae0b5 Linux 3.14 compat: IO acct, global_page_state, etc
generic_start_io_acct/generic_end_io_acct in the master
branch of the linux kernel requires that the request_queue
be provided.

Move the logic from freemem in the spl to arc_free_memory
in arc.c. Do this so we can take advantage of global_page_state
interface checks in zfs.

Upstream kernel replaced struct block_device with
struct gendisk in struct bio. Determine if the
function bio_set_dev exists during configure
and have zfs use that if it exists.

bio_set_dev https://github.com/torvalds/linux/commit/74d4699
global_node_page_state https://github.com/torvalds/linux/commit/75ef718
io acct https://github.com/torvalds/linux/commit/d62e26b

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6635
2017-09-16 11:00:19 -07:00
LOLi
835db58592 Add -vnP support to 'zfs send' for bookmarks
This leverages the functionality introduced in cf7684b to expose
verbose, dry-run and parsable 'zfs send' options for bookmarks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3666 
Closes #6601
2017-09-08 15:24:31 -07:00
Olaf Faaland
4c5b89f59e Improved dnode allocation and dmu_hold_impl()
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5396 
Closes #6522 
Closes #6414 
Closes #6564
2017-09-05 16:15:04 -07:00
Matthew Ahrens
1e0457e7f5 Enhance comments for large dnode project
Fix a few nits in the comments from large dnodes. Also import
some of the commit message as a comment in the code, making
it more accessible.

Reviewed-by: @rottegift 
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matt Ahrens <mahrens@delphix.com>
Closes #6551
2017-08-29 09:00:28 -07:00
Tom Caputi
9b8407638d Send / Recv Fixes following b52563
This patch fixes several issues discovered after
the encryption patch was merged:

* Fixed a bug where encrypted datasets could attempt
  to receive embedded data records.

* Fixed a bug where dirty records created by the recv
  code wasn't properly setting the dr_raw flag.

* Fixed a typo where a dmu_tx_commit() was changed to
  dmu_tx_abort()

* Fixed a few error handling bugs unrelated to the
  encryption patch in dmu_recv_stream()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6512 
Closes #6524 
Closes #6545
2017-08-23 16:54:24 -07:00
Gvozden Neskovic
551905dd47 vdev_mirror: kstat observables for preferred vdev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:54 -07:00
Gvozden Neskovic
d6c6590c5d vdev_mirror: load balancing fixes
vdev_queue:
- Track the last position of each vdev, including the io size,
  in order to detect linear access of the following zio.
- Remove duplicate `vq_lastoffset`

vdev_mirror:
- Correctly calculate the zio offset (signedness issue)
- Deprecate `vdev_queue_register_lastoffset()`
- Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:16 -07:00
Brian Behlendorf
c8f9061fc7 Retire legacy test infrastructure
* Removed zpios kmod, utility, headers and man page.

* Removed unused scripts zpios-profile/*, zpios-test/*,
  zpool-config/*, smb.sh, zpios-sanity.sh, zpios-survey.sh,
  zpios.sh, and zpool-create.sh.

* Removed zfs-script-config.sh.in.  When building 'make' generates
  a common.sh with in-tree path information from the common.sh.in
  template.  This file and sourced by the test scripts and used
  for in-tree testing, it is not included in the packages.  When
  building packages 'make install' uses the same template to
  create a new common.sh which is appropriate for the packaging.

* Removed unused functions/variables from scripts/common.sh.in.
  Only minimal path information and configuration environment
  variables remain.

* Removed unused scripts from scripts/ directory.

* Remaining shell scripts in the scripts directory updated to
  cleanly pass shellcheck and added to checked scripts.

* Renamed tests/test-runner/cmd/ to tests/test-runner/bin/ to
  match install location name.

* Removed last traces of the --enable-debug-dmu-tx configure
  options which was retired some time ago.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6509
2017-08-15 17:26:38 -07:00
Don Brady
d977122da9 Add corruption failure option to zinject(8)
Added a 'corrupt' error option that will flip a bit in the data
after a read operation.  This is useful for generating checksum
errors at the device layer (in a mirror config for example). It
is also used to validate the diagnosis of checksum errors from
the zfs diagnosis engine.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6345
2017-08-14 15:17:15 -07:00
Tom Caputi
b525630342 Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494 
Closes #5769
2017-08-14 10:36:48 -07:00
gaurkuma
9df9692637 Allow longer SPA names in stats
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #641
2017-08-11 08:53:35 -07:00
Brian Behlendorf
c25b8f99f8 Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks

* Update the zk_thread_create() function to use the same trick
  as Illumos.  Specifically, cast the new pthread_t to a void
  pointer and return that as the kthread_t *.  This avoids the
  issues associated with managing a wrapper structure and is
  safe as long as the callers never attempt to dereference it.

* Update all function prototypes passed to pthread_create() to
  match the expected prototype.  We were getting away this with
  before since the function were explicitly cast.

* Replaced direct zk_thread_create() calls with thread_create()
  for code consistency.  All consumers of libzpool now use the
  proper wrappers.

* The mutex_held() calls were converted to MUTEX_HELD().

* Removed all mutex_owner() calls and retired the interface.
  Instead use MUTEX_HELD() which provides the same information
  and allows the implementation details to be hidden.  In this
  case the use of the pthread_equals() function.

* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
  any non essential fields removed.  In the case of kthread_t
  and kcondvar_t they could be directly typedef'd to pthread_t
  and pthread_cond_t respectively.

* Removed all extra ASSERTS from the thread, mutex, rwlock, and
  cv wrapper functions.  In practice, pthreads already provides
  the vast majority of checks as long as we check the return
  code.  Removing this code from our wrappers help readability.

* Added TS_JOINABLE state flag to pass to request a joinable rather
  than detached thread.  This isn't a standard thread_create() state
  but it's the least invasive way to pass this information and is
  only used by ztest.

TEST_ZTEST_TIMEOUT=3600

Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547 
Closes #5503 
Closes #5523 
Closes #6377 
Closes #6495
2017-08-11 08:51:44 -07:00
Brian Behlendorf
46364cb2f3 Add libtpool (thread pools)
OpenZFS provides a library called tpool which implements thread
pools for user space applications.  Porting this library means
the zpool utility no longer needs to borrow the kernel mutex and
taskq interfaces from libzpool.  This code was updated to use
the tpool library which behaves in a very similar fashion.

Porting libtpool was relatively straight forward and minimal
modifications were needed.  The core changes were:

* Fully convert the library to use pthreads.
* Updated signal handling.
* lmalloc/lfree converted to calloc/free
* Implemented portable pthread_attr_clone() function.

Finally, update the build system such that libzpool.so is no
longer linked in to zfs(8), zpool(8), etc.  All that is required
is libzfs to which the zcommon soures were added (which is the way
it always should have been).  Removing the libzpool dependency
resulted in several build issues which needed to be resolved.

* Moved zfeature support to module/zcommon/zfeature_common.c
* Moved ratelimiting to to module/zfs/zfs_ratelimit.c
* Moved get_system_hostid() to lib/libspl/gethostid.c
* Removed use of cmn_err() in zcommon source
* Removed dprintf_setup() call from zpool_main.c and zfs_main.c
* Removed highbit() and lowbit()
* Removed unnecessary library dependencies from Makefiles
* Removed fletcher-4 kstat in user space
* Added sha2 support explicitly to libzfs
* Added highbit64() and lowbit64() to zpool_util.c

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6442
2017-08-09 15:31:08 -07:00
gaurkuma
520faf5ddc Crash in dbuf_evict_one with DTRACE_PROBE
Update the dbuf__evict__one() tracepoint so that it can safely
handle a NULL dmu_buf_impl_t pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>    
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #6463
2017-08-09 11:04:41 -07:00
Oleg Drokin
98cdcb8286 Remove misguided HAVE_MUTEX_OWNER check, take 2
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) ==
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #639
Closes #632
2017-08-02 20:50:27 -07:00
Gvozden Neskovic
261a3151e1 spl-mutex: fix race in mutex_exit
Prevent race on accessing kmutex_t when the mutex is
embedded in a ref counted structure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes zfsonlinux/zfs#6401
Closes #637
2017-08-02 20:42:58 -07:00
Brian Behlendorf
549423c0d4 Revert "Remove misguided HAVE_MUTEX_OWNER check"
This reverts commit d89616fda8 which
introduced some build failures which need to be resolved before
this can be merged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #633
2017-08-02 15:08:02 -04:00
Oleg Drokin
d89616fda8 Remove misguided HAVE_MUTEX_OWNER check
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) == 
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #632 
Closes #633
2017-08-02 11:45:16 -07:00
Ned Bass
8740cf4a2f Add line info and SET_ERROR() to ZFS debug log
Redefine the SET_ERROR macro in terms of __dprintf() so the error
return codes get logged as both tracepoint events (if tracepoints are
enabled) and as ZFS debug log entries.  This also allows us to use
the same definition of SET_ERROR() in kernel and user space.

Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise
or'd into zfs_flags. Setting this flag enables both dprintf() and
SET_ERROR() messages in the debug log. That is, setting
ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF|ZFS_DEBUG_SET_ERROR are
equivalent (this was done for sake of simplicity). Leaving
ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which
helps avoid cluttering up the logs.

To enable SET_ERROR() logging, run:

  echo 1 >   /sys/module/zfs/parameters/zfs_dbgmsg_enable
  echo 512 > /sys/module/zfs/parameters/zfs_flags

Remove the zfs_set_error_class tracepoints event class since
SET_ERROR() now uses __dprintf(). This sacrifices a bit of
granularity when selecting individual tracepoint events to enable but
it makes the code simpler.

Include file, function, and line number information in debug log
entries.  The information is now added to the message buffer in
__dprintf() and as a result the zfs_dprintf_class tracepoints event
class was changed from a 4 parameter interface to a single parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6400
2017-07-25 23:09:48 -07:00
Olaf Faaland
0582e40322 Add callback for zfs_multihost_interval
Add a callback to wake all running mmp threads when
zfs_multihost_interval is changed.

This is necessary when the interval is changed from a very large value
to a significantly lower one, while pools are imported that have the
multihost property enabled.

Without this commit, the mmp thread does not wake up and detect the new
interval until after it has waited the old multihost interval time.  A
user monitoring mmp writes via the provided kstat would be led to
believe that the changed setting did not work.

Added a test in the ZTS under mmp to verify the new functionality is
working.

Added a test to ztest which starts and stops mmp threads, and calls into
the code to signal sleeping mmp threads, to test for deadlocks or
similar locking issues.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6387
2017-07-25 13:22:20 -04:00
Serapheim Dimitropoulos
829f9251cf OpenZFS 8491 - uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds three new fields
in the uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed
and deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.

Porting Notes: Preserved MMP comments in uberblock struct.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8491
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d84fa5f
Closes #6390
2017-07-24 13:47:51 -04:00
Brian Behlendorf
36ba27e9e0 Linux 4.13 compat: bio->bi_status and blk_status_t
Commit torvalds/linux@4e4cbee9.  The bio->bi_error field was
replaced with bio->bi_status which is an enum that describes
all possible error types.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6351
2017-07-23 19:37:12 -07:00
Brian Behlendorf
944117514d Linux 4.13 compat: wait queues
Commit torvalds/linux@ac6424b9
- Renamed struct wait_queue -> struct wait_queue_entry.

Commit torvalds/linux@2055da97
- Renamed wait_queue_head::task_list -> wait_queue_head::head
- Renamed wait_queue_entry::task_list -> wait_queue_entry::entry

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #629
2017-07-23 19:32:14 -07:00
Olaf Faaland
379ca9cf2b Multi-modifier protection (MMP)
Add multihost=on|off pool property to control MMP.  When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp.  Property defaults to off.

During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock.  Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".

Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval.  The period is specified in milliseconds.  The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.

Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path.  Abbreviated
output below.

$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg   timestamp  mmp_delay   vdev_guid   vdev_label vdev_path
20468    261337  250274925   68396651780       3    /dev/sda
20468    261339  252023374   6267402363293     1    /dev/sdc
20468    261340  252000858   6698080955233     1    /dev/sdx
20468    261341  251980635   783892869810      2    /dev/sdy
20468    261342  253385953   8923255792467     3    /dev/sdd
20468    261344  253336622   042125143176      0    /dev/sdab
20468    261345  253310522   1200778101278     2    /dev/sde
20468    261346  253286429   0950576198362     2    /dev/sdt
20468    261347  253261545   96209817917       3    /dev/sds
20468    261349  253238188   8555725937673     3    /dev/sdb

Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.

When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test.  For example, the
pool is exported to run zdb and then imported again.  Add a new ztest
function, "-M", to alter ztest behavior to prevent this.

Add new tests to verify the new functionality.  Tests provided by
Giuseppe Di Natale.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #745
Closes #6279
2017-07-13 13:54:00 -04:00
Prakash Surya
dfbd813ec7 Add ASSERT3B/VERIFY3B/USEC2NSEC/NSEC2USEC macros
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #627
2017-07-13 13:19:15 -04:00
Dave Eddy
12fa0466df OpenZFS 6939 - add sysevents to zfs core for commands
Authored by: Dave Eddy <dave@daveeddy.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6939
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b
Closes #6328
2017-07-12 21:28:13 -07:00
LOLi
cf8738d853 Add port of FreeBSD 'volmode' property
The volmode property may be set to control the visibility of ZVOL
block devices.

This allow switching ZVOL between three modes:
   full - existing fully functional behaviour (default)
   dev  - hide partitions on ZVOL block devices
   none - not exposing volumes outside ZFS

Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.

This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.

Original-patch-by: mav <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796 
Closes #3438 
Closes #6233
2017-07-12 13:05:37 -07:00
Yuri Pankov
e19572e4cc OpenZFS 5428 - provide fts(), reallocarray(), and strtonum()
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* All hunks unrelated to ZFS were dropped.

OpenZFS-issue: https://www.illumos.org/issues/5428
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130
Closes #6326
2017-07-08 20:35:35 -07:00
Matthew Ahrens
a896468c78 OpenZFS 8067 - zdb should be able to dump literal embedded block pointer
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8067
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085
Closes #6319
2017-07-07 11:28:01 -07:00
Alek P
0ea05c64f8 Implemented zpool scrub pause/resume
Currently, there is no way to pause a scrub. Pausing may
be useful when the pool is busy with other I/O to preserve
bandwidth.

This patch adds the ability to pause and resume scrubbing.
This is achieved by maintaining a persistent on-disk scrub state.
While the state is 'paused' we do not scrub any more blocks.
We do however perform regular scan housekeeping such as
freeing async destroyed and deadlist blocks while paused.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6167
2017-07-06 22:16:13 -07:00
Andriy Gapon
8ca78ab002 OpenZFS 7600 - zfs rollback should pass target snapshot to kernel
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

The existing kernel-side code only provides a method to rollback to a
latest snapshot, whatever it happens to be at the time when the rollback
is actually done.  That could be unsafe or confusing in environments
where concurrent DSL changes are possible as the resulting state could
correspond to a newer or older snapshot than the originally requested
one.
This change allows to amend that method such that the rollback is
performed only when the latest snapshot has a specific name.  That is,
if a new snapshot is concurrently created or the target snapshot is
destroyed, then no rollback is done and EXDEV error is returned.
New libzfs_core function lzc_rollback_to() is provided for the new
functionality.  libzfs is changed to use lzc_rollback_to() to implement
zfs rollback command.
Perhaps we should return different errors to distinguish the case where
the desired snapshot exists but it's not the latest snapshot and the
case where the desired snapshot does not exist.

OpenZFS-issue: https://www.illumos.org/issues/7600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb
Closes #6292
2017-07-04 15:29:52 -07:00
Andriy Gapon
1579dcbc91 OpenZFS 8416 - abd.h is not C++ friendly
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8416
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/589c189
Closes #6288
2017-06-30 11:11:01 -07:00
Andriy Gapon
2a53d6a12a OpenZFS 8426 - mark immutable buffer arguments as such in abd.h
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/37359a6
Closes #6287
2017-06-30 11:11:01 -07:00
Giuseppe Di Natale
d12f91fde3 OpenZFS 8264 - want support for promoting datasets in libzfs_core
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8264
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a
Closes #6254
2017-06-26 16:56:09 -07:00
Tony Hutter
29eb494285 Dashes for zero latency values in zpool iostat -p
This prints dashes instead of zeros for zero latency values in
'zpool iostat -p'.  You'll get zero latencies reported when the
disk is idle, but technically a zero latency is invalid, since you
can't measure the latency of doing nothing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6210
2017-06-22 09:39:01 -07:00
Don Brady
0241e491a0 Inject zinject(8) a percentage amount of dev errs
In the original form of device error injection, it was an all or nothing
situation.  To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6227
2017-06-16 17:21:11 -07:00
Richard Yao
5228cf0116 Make zvol operations use _by_dnode routines
This continues what was started in
0eef1bde31 by fully converting zvols
to avoid unnecessary dnode_hold() calls. This saves a small amount
of CPU time and slightly improves latencies of operations on zvols.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@prophetstor.com>
Closes #6058
2017-06-13 09:18:08 -07:00
Matthew Ahrens
dbeb879699 OpenZFS 8199 - multi-threaded dmu_object_alloc()
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock.  To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.

The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs.  When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128).  Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.

This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.

Work sponsored by Intel Corp.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
2017-06-09 09:43:26 -07:00
Giuseppe Di Natale
1b7c1e5ce9 OpenZFS 7578 - Fix/improve some aspects of ZIL writing
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

 - Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG.  Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

 - Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size.  In some
cases efficiency stopped to almost as low as 50%.  In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas.  This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently.  Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

 - While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by:   iXsystems, Inc.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13ac
Closes #6191
2017-06-09 09:15:37 -07:00
Matthew Ahrens
82644107c4 OpenZFS 8155 - simplify dmu_write_policy handling of pre-compressed buffers
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.

We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.

OpenZFS-issue: https://www.illumos.org/issues/8155
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58
Closes #6200
2017-06-07 14:16:01 -04:00
LOLi
3e6c943347 Linux 4.12 compat: fix super_setup_bdi_name() call
Provide a format parameter to super_setup_bdi_name() so we don't
create duplicate names in '/devices/virtual/bdi' sysfs namespace which
would prevent us from mounting more than one ZFS filesystem at a time.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6147
2017-05-25 09:55:55 -07:00
Alek P
bec1067d54 Implemented zpool sync command
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6122
2017-05-19 12:33:11 -07:00
Tony Hutter
4a283c7f77 Force fault a vdev with 'zpool offline -f'
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline.  Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED.  The -f faults
persist across imports, unless they were set with the temporary
(-t) flag.  Both persistent and temporary faults can be cleared
with zpool clear.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6094
2017-05-19 12:30:16 -07:00
Brian Behlendorf
a9f0d7dd89 Fix large dnode send stream flag conflict
Bit 21 of the send stream flags was inadvertently used for two
different features under concurrent development.  To avoid any
future compatibility problems the large dnode flag is being
switched to bit 23 which is unused.

The large dnode feature has only been present in pre-releases of
ZoL and dnodesize defaults to legacy which is compatible with
existing OpenZFS implementations.  Users with dnodesize=auto
needing to use zfs send/recv must update ZoL on both the
source and destination systems.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6139
2017-05-18 10:02:16 -07:00
Isaac Huang
3d6da72d18 Skip spurious resilver IO on raidz vdev
On a raidz vdev, a block that does not span all child vdevs, excluding
its skip sectors if any, may not be affected by a child vdev outage or
failure. In such cases, the block does not need to be resilvered.
However, current resilver algorithm simply resilvers all blocks on a
degraded raidz vdev. Such spurious IO is not only wasteful, but also
adds the risk of overwriting good data.

This patch eliminates such spurious IOs.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #5316
2017-05-12 17:28:03 -07:00
Matthew Ahrens
4747a7d3d4 OpenZFS 8063 - verify that we do not attempt to access inactive txg
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.

Porting Notes:
- ASSERTV added to txg_sync_waiting() for unused variable.

OpenZFS-issue: https://www.illumos.org/issues/8063
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46
Closes #6109
2017-05-10 13:52:22 -04:00
Brian Behlendorf
2946a1a15a Linux 4.12 compat: CURRENT_TIME removed
Linux 4.9 added current_time() as the preferred interface to get
the filesystem time.  CURRENT_TIME was retired in Linux 4.12.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6114
2017-05-10 09:30:48 -07:00
LOLi
a3eeab2de6 Add property overriding (-o|-x) to 'zfs receive'
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.

This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.

References:
   https://www.illumos.org/issues/2745
   https://www.illumos.org/issues/3753

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1350 
Closes #5349
2017-05-09 16:21:09 -07:00
Christian Schwarz
305bc4b370 Make createtxg and guid properties public
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.

One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.

Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.

By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use

  a) `createtxg` to determine the logical & temporal order of snapshots
     (the creation property is not an equivalent substitute since
      multiple snapshots may be created within one second)
  b) `guid` to uniquely identify a snapshot, independent of its current
      display name

This has the potential of making scripts safer and correct.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #6102
2017-05-09 15:36:53 -07:00
Chunwei Chen
e624cd1959 Linux 4.12 compat: PF_FSTRANS was removed
zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related
check, so we use them accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6113
2017-05-09 10:38:46 -07:00
Chunwei Chen
8f87971e1f Linux 4.12 compat: PF_FSTRANS was removed
Change SPL_FSTRANS to optionally contains PF_FSTRANS. Also, add
__spl_pf_fstrans_check for the checks specifically for PF_FSTRANS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #614
2017-05-09 10:36:54 -07:00
Gvozden Neskovic
c17486b217 Add missing *_destroy/*_fini calls
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5428
2017-05-04 19:26:28 -04:00
Richard Yao
bc17f1047a Enable Linux read-ahead for a single page on ZVOLs
Linux has read-ahead logic designed to accelerate sequential workloads.
ZFS has its own read-ahead logic called zprefetch that operates on both
ZVOLs and datasets. Having two prefetchers active at the same time can
cause overprefetching, which unnecessarily reduces IOPS performance on
CoW filesystems like ZFS.

Testing shows that entirely disabling the Linux prefetch results in
a significant performance penalty for reads while commensurate benefits
are seen in random writes. It appears that read-ahead benefits are
inversely proportional to random write benefits, and so a single page
of Linux-layer read-ahead appears to offer the middle ground for both
workloads.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #5902
2017-05-04 18:00:27 -04:00
LOLi
dddef7d600 More ashift improvements
This commit allow higher ashift values (up to 16) in 'zpool create'

The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.

Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.

Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.

Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2024 
Closes #4205 
Closes #4740 
Closes #5763
2017-05-03 09:31:05 -07:00
Olaf Faaland
9d3f7b8791 Write label 2,3 uberblocks when vdev expands
When vdev_psize increases, the location of labels 2 and 3 changes
because their location is relative to the end of the device.

The configs for labels 2 and 3 are written during the next spa_sync()
because the vdev is added to the dirty config list.  However, the
uberblock rings are not re-written in their new location, leaving the
device vulnerable to the beginning of the device being overwritten or
damaged.

This patch copies the uberblock ring from label 0 to labels 2 and 3,
in their new locations, at the next sync after vdev_psize increases.

Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks
are copied.

Reviewed-by: BearBabyLiu <liu.huang@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5108
2017-05-02 13:55:24 -07:00
LOLi
e7fbeb606a Add zfs_nicebytes() to print human-readable sizes
* Add zfs_nicebytes() to print human-readable sizes

Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the
user when no units are specified. This add a new zfs_nicenum_format
"ZFS_NICENUM_BYTES" used to print bytes in their human-readable form.

Additionally, update some test cases to use machine-parsable 'zfs get'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2414 
Closes #3185 
Closes #3594 
Closes #6032
2017-05-02 13:43:53 -07:00
Brian Behlendorf
7dae2c81e7 Linux 4.12 compat: super_setup_bdi_name()
All filesystems were converted to dynamically allocated BDIs.  The
destruction of backing_dev_info structures is handled as part of
super block destruction.  Refactor the code to abstract away the
details of creating and destroying a BDI.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6089
2017-05-02 09:46:18 -07:00
Chunwei Chen
692e55b8fe Reinstate zvol_taskq to fix aio on zvol
Commit 37f9dac removed the zvol_taskq for processing zvol requests.
This was removed as part of switching to make_request_fn and was
motivated by a concern at the time over dispatch latency.

However, this also made all bio request synchronous, and caused
serious performance issues as the bio submitter would wait for
every bio it submitted, effectively making the IO depth 1.

This patch reinstate zvol_taskq, and to make sure overlapped I/Os
are ordered properly, we take range lock in zvol_request, and pass
it along with bio to the I/O functions zvol_{write,discard,read}.

In order to facilitate benchmarks a zvol_request_sync module
option was added to switch between sync and async request handling.
For the moment, the default behavior is synchronous but this is
likely to change pending additional testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5824
2017-04-26 13:54:40 -07:00
Tony Hutter
d6418de057 Prebaked scripts for zpool status/iostat -c
This patch updates the "zpool status/iostat -c" commands to only run
"pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever
you install to).  The scripts can only be run from -c as an unprivileged
user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is
set by root).  This was done to encourage scripts to be written is such
a way that normal users can use them, and to be cautious.  If your
script needs to run a privileged command, consider adding the
appropriate line in /etc/sudoers.  See zpool(8) for an example of how
to do this.

The patch also allows the scripts to output custom column names.  If
the script outputs a line like:

name=value

then "name" is used for the column name, and "value" is its value.
Multiple columns can be specified by outputting multiple lines.  Column
names and values can have spaces.  If the value is empty, a dash (-) is
printed instead.

After all the "name=value" lines are read (if any), zpool will take the
next the next line of output (if any) and print it without a column
header.  After that, no more lines will be processed. This can be
useful for printing errors.

Lastly, this patch also disables the -c option with the latency and
request size histograms, since it produced awkward output and made the
code harder to maintain.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5852
2017-04-21 09:27:04 -07:00
Brian Behlendorf
e550644f0c OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader project)
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- grub-2.02-beta2-422-gcad5cc0 includes support for large blocks.
- Commit 8aab121 allowed GZIP[1-9].
- Grub allows pools with multiple top-level vdevs.

OpenZFS-issue: https://www.illumos.org/issues/5120
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd
Closes #6007
2017-04-13 09:40:00 -07:00
Yuri Pankov
dbb38f6605 OpenZFS 6865 - want zfs-tests cases for zpool labelclear command
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Updated 'zpool labelclear' and 'zdb -l' such that they attempt
  to find a vdev given solely its short name.  This behavior is
  consistent with the upstream OpenZFS code and the test cases
  depend on it.  The actual implementation differs slightly due
  to device naming conventions on Linux.
- auto_online_001_pos, auto_replace_001_pos and add-o_ashift
  test cases updated to expect failure when no label exists.
- read_efi_label() and zpool_label_disk_check() are read-only
  operations and should use O_RDONLY at open time to enforce this.
- zpool_label_disk() and zpool_relabel_disk() write the partition
  information using O_DIRECT an fsync() and page cache invalidation
  to ensure a consistent view of the device.
- dump_label() in zdb should invalidate the page cache in order
  to get the authoritative label from disk.

OpenZFS-issue: https://www.illumos.org/issues/6865
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95076c
Closes #5981
2017-04-11 09:54:39 -07:00
Giuseppe Di Natale
42db43e982 OpenZFS 2932 - support crash dumps to raidz, etc. pools
Authored by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2932
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b
Closes #5984
Closes #5216
2017-04-10 10:24:17 -07:00
George Wilson
3b7f360c96 OpenZFS 8023 - Panic destroying a metaslab deferred range tree
Authored by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

We don't want to dirty any data when we're in the final txgs of the pool
export logic. This change introduces checks to make sure that no data is
dirtied after a certain point. It also addresses the culprit of this
specific bug – the space map cannot be upgraded when we're in final
stages of pool export. If we encounter a space map that wants to be
upgraded in this phase, then we simply ignore the request as it will get
retried the next time we set the fragmentation metric on that metaslab.

OpenZFS-issue: https://www.illumos.org/issues/8023
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ef00f5
Closes #5991
2017-04-09 16:12:35 -07:00
Toomas Soome
8aab121821 OpenZFS 7404 - rootpool_007_neg, bootfs_006_pos and bootfs_008_neg tests fail with the loader project bits
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Marcel Telka <marcel@telka.sk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Removed gzip and zle compression restriction on bootfs
  datasets.  Grub added support for these long ago.  Ay
  version of grub which understands lz4 also supports this.
- Enabled rootpool tests in runfile but skipped by default
  in setup on Linux since they modify the rootpool.
- bootfs_006_pos.ksh, striped pools are allowed as bootfs.

OpenZFS-issue: https://www.illumos.org/issues/7404
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a424c
Closes #5982
2017-04-07 14:18:19 -07:00
N Clark
bcdb96a3e1 Additional Information for Zedlets
* Add ZPOOL pool state to zfs_post_common to
  allow differentiation between export and destroy
  by zedlets.

* Add pool name as standard export  This ensures
  pool name is exported to zedlets.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Closes #5942
2017-04-03 14:23:02 -07:00
Gvozden Neskovic
84c07adadb Remove dependency on linear ABD
Wherever possible it's best to avoid depending on a linear ABD.
Update the code accordingly in the following areas.

- vdev_raidz
- zio, zio_checksum
- zfs_fm
- change abd_alloc_for_io() to use abd_alloc()

Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5668
2017-03-29 12:24:51 -07:00
LOLi
ff61d1a495 Check ashift validity in 'zpool add'
df83110 added the ability to specify a custom "ashift" value from the command
line in 'zpool add' and 'zpool attach'. This commit adds additional checks to
the provided ashift to prevent invalid values from being used, which could
result in disastrous consequences for the whole pool.

Additionally provide ASHIFT_MAX and ASHIFT_MIN definitions in spa.h.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #5878
2017-03-28 17:21:11 -07:00
Matthew Ahrens
64fc776208 OpenZFS 7968 - multi-threaded spa_sync()
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

spa_sync() iterates over all the dirty dnodes and processes each of them
by calling dnode_sync(). If there are many dirty dnodes (e.g. because we
created or removed a lot of files), the single thread of spa_sync()
calling dnode_sync() can become a bottleneck. Additionally, if many
dnodes are dirtied concurrently in open context (e.g. due to concurrent
file creation), the os_lock will experience lock contention via
dnode_setdirty().

The solution is to track dirty dnodes on a multilist_t, and for
spa_sync() to use separate threads to process each of the sublists in
the multilist.

OpenZFS-issue: https://www.illumos.org/issues/7968
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4a2a54c
Closes #5752
2017-03-20 18:36:00 -07:00
Olaf Faaland
a3478c0747 Linux 4.11 compat: iops.getattr and friends
In torvalds/linux@a528d35, there are changes to the getattr family of functions,
struct kstat, and the interface of inode_operations .getattr.

The inode_operations .getattr and simple_getattr() interface changed to:

int (*getattr) (const struct path *, struct dentry *, struct kstat *,
    u32 request_mask, unsigned int query_flags)

The request_mask argument indicates which field(s) the caller intends to use.
Fields the caller has not specified via request_mask may be set in the returned
struct anyway, but their values may be approximate.

The query_flags argument indicates whether the filesystem must update
the attributes from the backing store.

Currently both fields are ignored.  It is possible that getattr-related
functions within zfs could be optimized based on the request_mask.

struct kstat includes new fields:
u32               result_mask;  /* What fields the user got */
u64               attributes;   /* See STATX_ATTR_* flags */
struct timespec   btime;        /* File creation time */

Fields attribute and btime are cleared; the result_mask reflects this.  These
appear to be optional based on simple_getattr() and vfs_getattr() within the
kernel, which take the same approach.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5875
2017-03-20 17:51:16 -07:00
Olaf Faaland
9a054d54fb Linux 4.11 compat: add linux/sched/signal.h
In Linux 4.11, torvalds/linux@2a1f062, signal handling related functions
were moved from sched.h into sched/signal.h.

Add configure checks to detect this and include the new file where
needed.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:43 -07:00
Brian Behlendorf
1c2555ef92 Restructure mount option handling
Restructure the handling of mount options to be consistent with
upstream OpenZFS.  This required making the following changes.

- The zfs_mntopts_t was renamed vfs_t and adjusted to provide
  the minimal needed functionality.  This includes a pointer
  back to the associated zfsvfs_t.  Plus it made it possible
  to revert zfs_register_callbacks() and zfsvfs_create() back
  to their original prototypes.

- A zfs_mnt_t structure was added for the sole purpose of
  providing a structure to pass the osname and raw mount
  pointer to zfs_domount() without having to copy them.

- Mount option parsing was moved down from the zpl_* wrapper
  functions in to the zfs_* functions.  This allowed for the
  code to be simplied and it's where similar functionality
  appears on other platforms.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:41 -08:00
Brian Behlendorf
f298b24ddf Rename zfs_* functions
Several functions were renamed when ZFS was originally ported to
Linux.  Revert the code to the original names to minimize the
delta with upstream OpenZFS.

  zfs_sb_teardown -> zfsvfs_teardown
  zfs_sb_create -> zfsvfs_create
  zfs_sb_setup -> zfsvfs_setup
  zfs_sb_free -> zfsvfs_free
  get_zfs_sb -> getzfsvfs
  zfs_sb_hold -> zfsvfs_hold
  zfs_sb_rele -> zfsvfs_rele

  zfs_sb_prune_aliases  -> zfs_prune_aliases (Linux-only)
  zfs_sb_prune -> zfs_prune (Linux only)

Align the zfs_vnops.h and zfs_vfsops.h with upstream as much
as possible.  Several prototypes were removed and those that
remain were reordered.

Move the EXPORT_SYMBOL lines to the end of the source files
for consistency with the other source files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:35 -08:00
Brian Behlendorf
0037b49e83 Rename zfs_sb_t -> zfsvfs_t
The use of zfs_sb_t instead of zfsvfs_t results in unnecessary
conflicts with the upstream source.  Change all instances of
zfs_sb_t to zfsvfs_t including updating the variables names.

Whenever possible the code was updated to be consistent with
hope it appears in the upstream OpenZFS source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-03-10 09:51:33 -08:00
Chunwei Chen
9b77d1c958 Fix nfs snapdir automount
The current implementation for allowing nfs to access snapdir is very buggy.
It uses a special fh for snapdirs, such that the next time nfsd does
fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd
never knows it cross a mountpoint.

The problem is that nfsd will not hold a reference on the vfsmount of the
snapshot. This cause auto unmounter to unmount the snapshot even though nfs is
still holding dentries in it.

To fix this, we return the inode for the snapdirs themselves. However, we also
trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate
and see the mountpoint and do crossmnt.

Because nfsd will now be aware that these are different filesystems users
must add crossmnt to their export options to access snapshot directories.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3794
Closes #4716
Closes #5810 
Closes #5833
2017-03-08 09:26:33 -08:00
Tony Hutter
463009865f Fix harmless "BARRIER is deprecated" kernel warning on Centos 6.8
A one time warning after module load that "BARRIER is deprecated" was seen
on the heavily patched 2.6.32-642.13.1.el6.x86_64 Centos 6.8 kernel.  It seems
that kernel had both the old BARRIER and the newer FLUSH/FUA interfaces
defined.  This fixes the warning by prefering the newer FLUSH/FUA interface
if it's available.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5739 
Closes #5828
2017-03-08 09:20:21 -08:00
bunder2015
5fc73c46f9 Fix multi-line error messages in blkdev_compat.h
Fix multi-line error messages in blkdev_compat.h by changing
error-generating multi-line error messages to single line errors.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #5860
2017-03-07 09:54:55 -08:00
Brian Behlendorf
3ec3bc2167 OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.

The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).

This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().

The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.

We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).

OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804 

Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
  Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
  applied to OpenZFS.
2017-03-07 09:51:59 -08:00
Olaf Faaland
4859fe796c Linux 4.11 compat: avoid refcount_t name conflict
Linux 4.11 introduces a new type, refcount_t, which conflicts with the
type of the same name defined within ZFS.

Rename the ZFS type zfs_refcount_t.  Within the ZFS code, use a macro to
cause references to refcount_t to be changed to zfs_refcount_t at
compile time.  This reduces conflicts when later landing OpenZFS
patches.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5823 
Closes #5842
2017-02-28 16:10:18 -08:00
Matthew Ahrens
66eead53c9 Clean up by-dnode code in dmu_tx.c
0eef1bde31
introduced some changes which we slightly improved the style of when
porting to illumos.

There is also one minor error-handling fix, in zap_add() the "zap" may
become NULL in case of an error re-opening the ZAP.

Originally suggested at: https://github.com/openzfs/openzfs/pull/276

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5805
2017-02-24 13:34:26 -08:00
Andriy Gapon
0efd97912a OpenZFS 7199 - dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7199
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bfaed0b
Closes #5817
2017-02-24 11:05:33 -08:00
Matthew Ahrens
c30e58c462 zfs_arc_num_sublists_per_state should be common to all multilists
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.  This tuning may be overridden
on a per-multilist basis.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #5764
2017-02-15 15:49:33 -08:00
Matthew Ahrens
d7958b4cda OpenZFS 7104 - increase indirect block size
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7104
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4b5c8e9
Closes #5679
2017-02-09 10:27:02 -08:00
Giuseppe Di Natale
d21d5b8248 OpenZFS 4521 - zfstest is trying to execute evil "zfs unmount -a"
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
- Correctly set __ZFS_POOL_RESTRICT in inherit_001_pos

OpenZFS-issue: https://www.illumos.org/issues/4521
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8808ac5
Closes #5674
2017-02-03 13:24:44 -08:00
George Melikov
9b7b9cd370 OpenZFS 1300 - filename normalization doesn't work for removes
Authored by: Kevin Crowe <kevin.crowe@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/1300
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d
Closes #5725 

Porting notes:
- zap_micro.c: all `MT_EXACT` are replaced by `0`
2017-02-02 14:13:41 -08:00
David Quigley
2fe36b0bfb Use fletcher_4 routines natively with abd_iterate_func()
This patch adds the necessary infrastructure for ABD to make use
of the vectorized fletcher 4 routines.

- export ABD compatible interface from fletcher_4
- add ABD fletcher_4 tests for data and metadata ABD types.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #5589
2017-02-01 09:34:22 -08:00
George Melikov
539d33c791 OpenZFS 6569 - large file delete can starve out write ops
Authored by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/6569
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1bf4b6f2
Closes #5706
2017-01-31 14:44:03 -08:00
David Quigley
43b857fddb Add a PAGESHIFT definition
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #598
2017-01-31 10:36:18 -08:00
George Melikov
ed828c0c37 OpenZFS 7280 - Allow changing global libzpool variables in zdb and ztest through command line
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e60744
Closes #5676
2017-01-31 10:13:10 -08:00
George Melikov
28b40c8a6e OpenZFS 7541 - zpool import/tryimport ioctl returns ENOMEM
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>

The refresh_config() calls into the kernel with ZFS_IOC_POOL_TRYIMPORT.
This ioctl returns the config of the pool in a buffer pre-allocated in
userland. The original estimate for the size is too conservative since
it doesn't account for the large size of vdev stats that are added to
the config before returning.
This fix simply increases the size of the buffer passed. This results in
a speed up of the zpool import process, and less spam in zfs_dbgmsg.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7541
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c7690
Closes #5704
2017-01-30 13:20:54 -08:00
George Melikov
fa603f8233 OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's
Porting notes:
- 'zfs_dbgmsg_print()' reintroduced to userspace.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7277
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f
Closes #5684
2017-01-28 12:16:43 -08:00
George Melikov
a08abc1bb3 OpenZFS 7301 - zpool export -f should be able to interrupt file freeing
Authored by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7301
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb72182
Closes #5680
2017-01-27 11:46:39 -08:00
George Melikov
cc9bb3e58e OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7254
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c166b69
Closes #5670
2017-01-27 11:43:42 -08:00
Tim Chase
258553d3d7 OpenZFS 7613 - ms_freetree[4] is only used in syncing context
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We
should replace it with two trees: the freeing tree (ranges that we are
freeing this syncing txg) and the freed tree (ranges which have been
freed this txg).

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7613
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8698da2
Closes #5598
2017-01-26 15:27:19 -08:00
George Melikov
9c9531cb6f OpenZFS 7500 - Simplify dbuf_free_range by removing dn_unlisted_l0_blkid
Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7500
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1b
Closes #5639
2017-01-26 15:15:48 -08:00
George Melikov
39efbde7c5 OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change
Authored by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6676
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40510e8
Closes #5667
2017-01-26 14:43:28 -08:00
George Melikov
aeacdefedc OpenZFS 7386 - zfs get does not work properly with bookmarks
Authored by: Marcel Telka <marcel@telka.sk>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7386
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a
Closes #5666
2017-01-26 14:42:15 -08:00
George Melikov
774ee3c7ce OpenZFS 7336 - vfork and O_CLOEXEC causes zfs_mount EBUSY
Porting notes:
- statvfs64 is replaced by statfs64.
- ZFS_SUPER_MAGIC definition moved in include/sys/fs/zfs.h
  to share it between user and kernel space.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7336
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dd862f6d
Closes #5651
2017-01-26 12:28:29 -08:00
George Melikov
e2a65adbb8 OpenZFS 6871 - libzpool implementation of thread_create should enforce length is 0
Porting notes:
- Several direct callers of zk_thread_create() are passing TS_RUN for the
length.  The `len` and `state` were inverted,this commit fixes them.

Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/6871
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8fc9228
Closes #5621
2017-01-24 09:13:49 -08:00
Brian Behlendorf
e82dbae1ee Fix build-it compilation regression
Accidentally introduced by 4ea3f86.  The BEGIN CSTYLE block cannot
appear half way through a continued #define.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5643
Closes #5644
2017-01-24 08:50:15 -08:00
George Melikov
ec923db25c OpenZFS 7180 - potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7180
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/690041b
Closes #5627
2017-01-23 10:53:46 -08:00
George Melikov
e67a7ffb5d OpenZFS 6052 - decouple lzc_create() from the implementation details
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/6052
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/26455f9
Closes #5622
2017-01-23 09:49:57 -08:00
George Melikov
f85c06bedf OpenZFS 7054 - dmu_tx_hold_t should use refcount_t to track space
Authored by: Igor Kozhukhov ikozhukhov@gmail.com
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/7054
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c779ad
Closes #5600
2017-01-23 09:36:24 -08:00
George Melikov
4ea3f86426 codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
Chunwei Chen
040dab9939 Suspend/resume zvol for recv and rollback
When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be
called to swap out the ds_objset and do dmu_objset_evict on the old one.
However, currently zv->zv_objset will not be swapped out accordingly, so
if anyone currently holds a fd on the zvol, we risk hitting a use-after-free.

We fix this by introducing the suspend and resume mechanism of zsb to
zv.  Before recv or rollback, we use zvol_suspend to block all access to
zv_objset and shut it down. After the recv or rollback, we use zvol_resume
to swap in zv_objset with the new ds_objset and unblock the access.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4866 
Closes #5609
2017-01-19 13:56:36 -08:00
George Melikov
7330fc57b7 OpenZFS 7235 - remove unused func dsl_dataset_set_blkptr
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bd56f80
Closes #5604
2017-01-17 15:22:56 -08:00
Brian Behlendorf
648a09adc2 OpenZFS 6550 - cmd/zfs: cleanup gcc warnings
Porting Notes:
- Many of the fixes proposed by this patch were already applied.
In the cases where a different but equivalent fix was made the
code was updated with the OpenZFS version to minimize differences.

Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6550
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c16bcc4
Closes #5591
2017-01-17 14:45:02 -08:00
bzzz77
0eef1bde31 Add *_by-dnode routines
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by 
(objset_t *, uint64_t object).  This change converts some but
not all of the existing consumers.  As performance-sensitive
code paths are discovered they should be converted to use
these routines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes #5534 
Issue #4802
2017-01-13 14:58:41 -08:00
Don Brady
38640550f2 OpenZFS 7743 - per-vdev-zaps init path for upgrade
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Joe Stein <jas14@cs.brown.edu>
Ported-by: Don Brady <don.brady@intel.com>

When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.

This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.

The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.

OpenZFS-issue: https://www.illumos.org/issues/7743
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0
Closes #5582
2017-01-13 13:50:22 -08:00
Don Brady
4e21fd060a OpenZFS 7303 - dynamic metaslab selection
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).

Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>

OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd
Closes #5404
2017-01-12 11:52:56 -08:00
George Melikov
e9aa730c49 OpenZFS 6328 - Fix cstyle errors in zfs codebase
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/6328
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb
Closes #5579
2017-01-12 09:42:11 -08:00
George Melikov
24d42e2221 OpenZFS 7259 - DS_FIELD_LARGE_BLOCKS is unused
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch: 241b541 Illumos 5959 - clean up per-dataset feature count code.

This patch simply removes this macro from dsl_dataset.h.

OpenZFS-issue: https://www.illumos.org/issues/7259
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/faa8036
Closes #5544
2017-01-03 12:03:05 -06:00
ka7
4e33ba4c38 Fix spelling
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5547 
Closes #5543
2017-01-03 11:31:18 -06:00
Tim Chase
a5e046eaac 4.10 compat - BIO flag changes and others
[bio] The req_op enum was changed to req_opf.  Update the "Linux 4.8 API"
autotools checks to use an int to determine whether the various REQ_OP
values are defined.  This should work properly on kernels >= 4.8.

[bio] bio_set_op_attrs() is now an inline function and can't be detected
with #ifdef.  Add a configure check to determine whether bio_set_op_attrs()
is defined.  Move the local definition of it from vdev_disk.c to
blkdev_compat.h for consistency with other related compability shims.

[bio] The read/write flags and their modifiers, including WRITE_FLUSH,
WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h.  Add the new
bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA
and set the flags appropriately for each supported kernel version.

[vfs] The generic_readlink() function has been made static.  If .readlink
in inode_operations is NULL, generic_readlink() is used.

[zol typo] Completely unrelated to 4.10 compat, fix a typo in the check
for REQ_OP_SECURE_ERASE so that the proper macro is defined:

    s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5499
2016-12-30 16:03:59 -06:00
Chunwei Chen
da8f51e16a Use a dedicated taskq for vdev_file
The introduction of parallel zvol prefetch causes deadlock when using
vdev_file.

spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync)
txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq)
zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock)

We fix this by using dedicated taskq for vdev_file.  This same change
was originally made in commit bc25c93 but reverted in commit aa9af22
when dynamic taskqs were added.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #5506 
Closes #5495
2016-12-21 10:47:15 -08:00
Clemens Fruhwirth
8e99d66b05 Add support for rw semaphore under PREEMPT_RT_FULL
The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread.  All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.

The implementation can be best understood by inspecting the RT
patch.  rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.

Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support.  Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.

Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes zfsonlinux/zfs#5491
Closes #589
Closes #308
2016-12-19 12:45:24 -08:00
Clemens Fruhwirth
6d064f7a07 Remove stale comment from rw_tryupgrade()
Commit f58040c0fc should have removed
this comment which is no longer relevant.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Issue #589
2016-12-19 11:27:27 -08:00
Chunwei Chen
05100ec8f0 Fix wrong operator in xvattr.h
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-14 14:53:56 -08:00
Brian Behlendorf
02730c333c Use cstyle -cpP in make cstyle check
Enable picky cstyle checks and resolve the new warnings.  The vast
majority of the changes needed were to handle minor issues with
whitespace formatting.  This patch contains no functional changes.

Non-whitespace changes are as follows:

* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5465
2016-12-12 10:46:26 -08:00
Brian Behlendorf
f95e647891 Speed up zvol import and export speed
Speed up import and export speed by:

* Add system delay taskq
* Parallel prefetch zvol dnodes during zvol_create_minors
* Parallel zvol_free during zvol_remove_minors
* Reduce list linear search using ida and hash

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5433
2016-12-08 14:05:02 -07:00
Chunwei Chen
f200b83673 Add system_delay_taskq for long delay
Add a dedicated system_delay_taskq for long delay like spa_deadman and
zpl_posix_acl_free. This will allow us to use system_taskq in the manner of
dispatch multiple tasks and call taskq_wait_outstanding.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #588
2016-12-08 14:00:20 -07:00
Gvozden Neskovic
e8a2014436 Cache ddt_get_dedup_dspace() value if there was no ddt changes
Save and reuse ddt dspace calculation when there have been no ddt changes.
This avoids unnecessary traversal of 168KiB of ddt histograms.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5425
2016-12-02 16:59:35 -07:00
Brian Behlendorf
baf67d15a5 Refactor txg history kstat
It was observed that even when the txg history is disabled by
setting `zfs_txg_history=0` the txg_sync thread still fetches
the vdev stats unnecessarily.

This patch refactors the code such that vdev_get_stats() is no
longer called when `zfs_txg_history=0`.  And it further reduces
the  differences between upstream and the ZoL txg_sync_thread()
function.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5412
2016-12-02 16:57:49 -07:00
cao
e2c7d3785a Remove unused sa_update_from_cb()
It looks like this was functionality which was added in the
original SA implementation and then never needed.  It can
be safely removed now and easily added back if we find a
use for it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5440
2016-12-01 16:39:06 -07:00
cao
6a8fd57fa7 Compile zio.h and zio_impl.h mutual include
zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the
header files to contain each other.  Get rid of the zio_impl.h include
in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5439
2016-12-01 16:36:25 -07:00
Chunwei Chen
57ddcda164 Use system_delay_taskq for long delay tasks
Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire.
This free system_taskq from the above long delay tasks, and allow us to do
taskq_wait_outstanding on system_taskq without being blocked forever, making
system_taskq more generic and useful.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-01 14:52:48 -08:00
Chunwei Chen
9829574834 ABD optimized page allocation code
* Convert ABD to use the Linux Kernel scatterlist implementation
  instead of the hand rolled one from illumos.

* Scatter ABDs are preferentially populated with higher order
  compound pages from a single zone.  Allocation size is
  progressively decreased until it can be satisfied without
  performing reclaim or compaction.

* An alternate page allocator is provided for kernels older
  than 3.6 and for CONFIG_HIGHMEM systems.  This allocator
  is designed as a fallback for maximum compatibility.

* Extended abdstats to provide visibility in the the allocator.

* Add cached value for PAGESIZE in userspace.

Contributions-by:
Chunwei Chen <david.chen@osnexus.com>
Gvozden Neskovic <neskovic@gmail.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Isaac Huang <he.huang@intel.com>
David Quigley <david.quigley@intel.com>
Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-29 14:34:33 -08:00
Gvozden Neskovic
a206522c4f ABD changes for vectorized RAIDZ
* userspace: aligned buffers. Minimum of 32B alignment is
  needed for AVX2. Kernel buffers are aligned 512B or more.
* add abd_get_offset_size() interface
* abd_iter_map(): fix calculation of iter_mapsize
* add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-11-29 14:34:33 -08:00
Isaac Huang
b0be93e81a ABD page support to vdev_disk.c
Signed-off-by: Isaac Huang <he.huang@intel.com>
2016-11-29 14:34:32 -08:00
David Quigley
a6255b7fce DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
Tony Hutter
8720e9e748 Add -c to zpool iostat & status to run command
This patch adds a command (-c) option to zpool status and zpool iostat.  The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat.  The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command.  For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk.  This can be useful
if the command you're running requires a /dev/sd* device.

The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper.  This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root.  This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.

Example:

$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'

NAME        STATE     READ WRITE CKSUM
mypool      ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    mpatha  ONLINE       0     0     0  I am /dev/mapper/mpatha, /dev/sdc
    sdb     ONLINE       0     0     0  I am /dev/sdb1, /dev/sdb

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #5368
2016-11-29 14:45:38 -07:00
LOLi
2f71caf2d9 Allow zfs unshare <protocol> -a
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.

Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.

Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3238 
Closes #5367
2016-11-29 12:22:38 -07:00
cao
3bfd95d589 Fix coverity defects: CID 147540, 147542
CID 147540: unsigned_compare
- Cast nsec to a int32_t to properly detect the expected overflow.
CID 147542: unsigned_compare
- intval can never be less than ZIO_FAILURE_MODE_WAIT which is
  defined to be zero.  Remove this useless check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5379
2016-11-09 17:35:26 -08:00
jxiong
126ae9f4e9 Export symbol dmu_objset_userobjspace_upgradable
It's used by Lustre to determine if the objset can be upgraded.
The inline version doesn't work because dmu_objset_is_snapshot()
is not exported.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes #5385
2016-11-09 13:51:12 -08:00
tuxoko
0420c126ce Linux 3.14 compat: assign inode->set_acl
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Massimo Maggi <me@massimo-maggi.eu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #5371 
Closes #5375
2016-11-09 10:37:17 -08:00
Chunwei Chen
3779913b35 Use set_cached_acl and forget_cached_acl when possible
Originally, these two function are inline, so their usability is tied to
posix_acl_release. However, since Linux 3.14, they became EXPORT_SYMBOL, so we
can always use them. In this patch, we create an independent test for these
two functions so we can use them when possible.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-07 11:04:44 -08:00
Chunwei Chen
8e71ab99dc Batch free zpl_posix_acl_release
Currently every calls to zpl_posix_acl_release will schedule a delayed task,
and each delayed task will add a timer. This used to be fine except for
possibly bad performance impact.

However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In
this new implementation, the larger the delay, the less accuracy the timer is.
So when we have a flood of timer from zpl_posix_acl_release, they will expire
at the same time. Couple with the fact that task_expire will do linear search
with lock held. This causes an extreme amount of contention inside interrupt
and would actually lockup the system.

We fix this by doing batch free to prevent a flood of delayed task. Every call
to zpl_posix_acl_release will put the posix_acl to be freed on a lockless
list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free
every posix_acl that passed the grace period on the list. This way, we only
have one delayed task every second.

[1] https://lwn.net/Articles/646950/

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-07 11:04:44 -08:00
Brian Behlendorf
83bf769d50 Fix 'zpool import' detection issues
This patch addresses multiple 'zpool import' block device
indentification problems which are most likely to occur on a
system configured to use blkid, by_vdev paths, multipath and
failover.  The symptom most commonly observed is the import
uses different path names to import the pool than would
normally be expected.

* When using blkid to identify vdevs the listed devices may
be added to the cache in any order.  In order to apply the
preferred search order heuristic a zfs_path_order() function
was added to calculate the order given full path names.

* Since it's possible to have multiple block devices with
different vdev guids which refer to the same ZPOOL_CONFIG_PATH
the slice cache must be indexed by guid and name.  By avoiding
collisions the preferred ordering can be maintaining even
when multiple block devices claim the same ZPOOL_CONFIG_PATH.
The preferred sorting by partition was never benefitial for
a Linux system and was removed as part of this change.

* When adding entries to the blkid cache avl_find/avl_insert
are used instead of avl_add because collisions are possible
and must be handled gracefully.

* For pools using multipath devices there are, at a minimum,
three devices where a vdev label may be read.  They are the
dm-* device and each underlying /dev/sd* device.  Due to the
way the block cache is implemented each of these devices may
have a different cached copy of the vdev label.  This can
result in "ghost pools" which appear to persist even after
a 'zpool labelclear' has been done to the dm-* device.  In
order to prevent this the vdev label is read with O_DIRECT
in order to bypass any caching to get the on-disk version.

* When opening a block device verify that vdev guid read from
the disk matches the expected vdev guid.  This allows for bad
labels to be filtered out.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5359
2016-11-07 10:28:57 -08:00
Romain Dolbeau
7f3194932d Add superscalar fletcher4
This is the Fletcher4 algorithm implemented in pure C, but using
multiple counters using algorithms identical to those used for
SSE/NEON and AVX2.

This allows for faster execution on core with strong superscalar
capabilities but weak SIMD capabilities.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #5317
2016-11-04 10:53:03 -07:00
Chunwei Chen
ace1eae84c Add support for O_TMPFILE
Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on
supported filesystem. It's basically doing open(2) and unlink(2) atomically.

The filesystem support is added through i_op->tmpfile. We basically copy the
create operation except we get rid of the link and name related stuff and add
the new node to unlinked set.

We also add support for linkat(2) to link tmpfile. However, since all previous
file operation will skip ZIL, we force a txg_wait_synced to make sure we are
sync safe.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-04 10:46:40 -07:00
Chunwei Chen
987014903f Fix unlinked file cannot do xattr operations
Currently, doing things like fsetxattr(2) on an unlinked file will result in
ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget.

The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we
need it to not return error when the zp is unlinked. This is a pretty big
change in behavior, but skimming through all the callers, I don't think this
change would cause any problem. Also there's nothing preventing z_unlinked
from being set after the z_lock mutex is dropped before but before zfs_zget
returns anyway.

The rest of the stuff is to make sure we don't log xattr stuff when owner is
unlinked.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-11-04 10:46:40 -07:00
Romain Dolbeau
7f547f85fe Add parity generation/rebuild using AVX-512 for x86-64
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.

avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes #5219
2016-11-02 12:40:23 -07:00
Brian Behlendorf
82ec9d41d8 Fix 32-bit maximum volume size
A limit of 1TB exists for zvols on 32-bit systems.  Update the code
to correctly reflect this limitation in a similar manor as the
OpenZFS implementation.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
2016-11-02 12:14:45 -07:00
Brian Behlendorf
48d3eb40c7 Add TASKQID_INVALID
Add the TASKQID_INVALID macros and update callers to use the macro
instead of testing against 0.  There is no functional change
even though the functions in zfs_ctldir.c incorrectly used -1
instead of 0.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
2016-11-02 12:14:45 -07:00
Ubuntu
cbba714667 Add TASKQID_INVALID and TASKQID_INITIAL macros
Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the
taskq implementation and test cases to use them.  This is solely
for the purposes of readability and introduces no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-02 10:34:19 -07:00
cao
2bac68145f Fix coverity defects: CID 147548
CID 147548: Type:Dereference null return value

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5321
2016-10-31 16:56:10 -07:00
Hajo Möller
e02aaf17f1 Fix lookup_bdev() on Ubuntu
Ubuntu added support for checking inode permissions to lookup_bdev() in kernel
commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21).
Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517

This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and
calls the function in the correct way.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Closes #5336
2016-10-26 10:30:43 -07:00
jxiong
16fa68f07d Do not upgrade userobj accounting for snapshot dataset
'zfs recv' could disown a living objset without calling
dmu_objset_disown(). This will cause the problem that the objset
would be released while the upgrading thread is still running.

This patch avoids the problem by checking if a dataset is a snapshot
before calling dmu_objset_userobjspace_upgrade().  Snapshots
are immutable and therefore it doesn't make sense to update them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes #5295 
Closes #5328
2016-10-25 13:21:05 -07:00
Tony Hutter
1bbd877049 Turn on/off enclosure slot fault LED even when disk isn't present
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on.  During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.

To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import.  It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly.  This allows the script to turn on/off the slot
LEDs even when the drive is missing.

Closes #5309 
Closes #2375
2016-10-24 10:45:59 -07:00
Romain Dolbeau
24cdeaf12e Fletcher4 algorithm implemented in pure NEON for Aarch64 / ARMv8 64 bits
This is not useful on micro-architecture with a weak NEON
implementation (only 64 bits); the native version is slower &
the byteswap barely faster than scalar.  On A53 or A57, it's
a small improvement on scalar but OK for byteswap.

Results from an A53 system:
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation   native         byteswap       
scalar           1008227510     755880264      
aarch64_neon     1198098720     1044818671     
fastest          aarch64_neon   aarch64_neon 

Results from a A57 system:
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation   native         byteswap       
scalar           2302071241     1124873346     
aarch64_neon     2542214946     2245570352     
fastest          aarch64_neon   aarch64_neon 

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #5248
2016-10-21 10:55:49 -07:00
cao
5a6765cf8c Fix coverity defects: CID 147472
CID 147472: Type: 'Constant' variable guards dead code

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5288
2016-10-20 11:24:01 -07:00
Brian Behlendorf
3b0ba3ba99 Linux 4.9 compat: inode_change_ok() renamed setattr_prepare()
In torvalds/linux@31051c8 the inode_change_ok() function was
renamed setattr_prepare() and updated to take a dentry ratheri
than an inode.  Update the code to call the setattr_prepare()
and add a wrapper function which call inode_change_ok() for
older kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Requires-spl: refs/pull/581/head
2016-10-20 09:39:09 -07:00
Chunwei Chen
ae7eda1dde Linux 4.9 compat: group_info changes
In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via
->blocks to 1d array via ->gid. We change the spl cred functions accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #581
2016-10-20 09:33:28 -07:00
Tony Hutter
6078881aa1 Multipath autoreplace, control enclosure LEDs, event rate limiting
1. Enable multipath autoreplace support for FMA.

This extends FMA autoreplace to work with multipath disks.  This
requires libdevmapper to be installed at build time.

2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online

Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED.  Your enclosure must
be supported by the Linux SES driver for this to work.  The enclosure LED
scripts work for multipath devices as well.  The scripts will clear the LED
when the fault is cleared.

3. Rate limit ZIO delay and checksum events so as not to flood ZED

ZIO delay and checksum events are rate limited to 5/sec in the zfs module.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #2449 
Closes #3017 
Closes #5159
2016-10-19 12:55:59 -07:00
Don Brady
3dfb57a35e OpenZFS 7090 - zfs should throttle allocations
OpenZFS 7090 - zfs should throttle allocations

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.

In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.

The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7
Closes #5258 

Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
2016-10-13 17:59:18 -07:00
tuxoko
2529b3a80e Linux 4.8 compat: Fix RW_READ_HELD
Linux 4.8, starting from torvalds/linux@19c5d690e, will set owner to 1 when
read held instead of leave it NULL. So we change the condition to
`rw_owner(rwp) <= 1` in RW_READ_HELD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes zfsonlinux/zfs#5233 
Closes #577
2016-10-07 20:53:58 -07:00
Brian Behlendorf
482cd9ee69 Fletcher4: Incremental updates and ctx calculation
Fixes ABI issues with fletcher4 code, adds support for
incremental updates, and adds ztest method for testing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #5164
2016-10-07 12:44:12 -07:00
Jinshan Xiong
1de321e626 Add support for user/group dnode accounting & quota
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
2016-10-07 09:45:13 -07:00
Gvozden Neskovic
5bf703b8f3 Fletcher4: save/reload implementation context
Init, compute, and fini methods are changed to work on internal context object.
This is necessary because ABI does not guarantee that SIMD registers will be preserved
on function calls. This is technically the case in Linux kernel in between
`kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that
don't require disabling preemption for using SIMD (osx).

Use scalar compute methods in-place for small buffers, and when the buffer size
does not meet SIMD size alignment.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-10-05 16:41:46 +02:00
Tony Hutter
3c67d83a8a OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:

    b5e030c8db

The list of porting changes includes:

- Copied module/icp/include/sha2/sha2.h directly from illumos

- Removed from module/icp/algs/sha2/sha2.c:
	#pragma inline(SHA256Init, SHA384Init, SHA512Init)

- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
  it now takes in an extra parameter.

- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c

- Added skein & edonr to libicp/Makefile.am

- Added sha512.S.  It was generated from sha512-x86_64.pl in Illumos.

- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.

- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
  to not #include the non-existant endian.h.

- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
  around a compiler warning.

- Fixup test files:
	- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
	- Remove <note.h> and define NOTE() as NOP.
	- Define u_longlong_t
	- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
	- Rename NULL to 0 in "no test vector" array entries to get around a
	  compiler warning.
	- Remove "for isa in $($ISAINFO); do" stuff
	- Add/update Makefiles
	- Add some userspace headers like stdio.h/stdlib.h in places of
	  sys/types.h.

- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.

- Update scripts/zfs2zol-patch.sed

- include <sys/sha2.h> in sha2_impl.h

- Add sha2.h to include/sys/Makefile.am

- Add skein and edonr dirs to icp Makefile

- Add new checksums to zpool_get.cfg

- Move checksum switch block from zfs_secpolicy_setprop() to
  zfs_check_settable()

- Fix -Wuninitialized error in edonr_byteorder.h on PPC

- Fix stack frame size errors on ARM32
  	- Don't unroll loops in Skein on 32-bit to save stack space
  	- Add memory barriers in sha2.c on 32-bit to save stack space

- Add filetest_001_pos.ksh checksum sanity test

- Add option to write psudorandom data in file_write utility
2016-10-03 14:51:15 -07:00
Romain Dolbeau
62a65a654e Add parity generation/rebuild using 128-bits NEON for Aarch64
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.

As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.

This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.

It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.

The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.

Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes #4801
2016-10-03 09:44:00 -07:00
Gvozden Neskovic
031d7c2fe6 fix: Shift exponent too large
Undefined operation is reported by running ztest (or zloop) compiled with GCC
UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection
with large enough offset values. Logically, left shift operation would work,
but bit shift semantics in C, and limitation of uint64_t, do not produce desired
result.

Issue #5059, #4883

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
2016-09-29 15:55:41 -07:00
kernelOfTruth aka. kOT, Gentoo user
51907a31bc OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records
Authored by: Matt Krantz <matt.krantz@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/7230
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2
Closes #5112
2016-09-22 16:01:19 -07:00
cao
884385a0b2 Fix coverity defects
Fix coverity defects:
coverity scan CID:147623, Type: Resource leak.
coverity scan CID:147622, Type: Resource leak.
reason: zpool_open zhp, but not zpool_close zhp. so resource leak.

coverity scan CID:147621, Type: Resource fd leak.
coverity scan CID:147620, Type: Resource fd leak.
reason: do_write do_read open file fd,but exception not close fd.

delete unuse definition DMU_OS_IS_L2COMPRESSIBLE.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes #5137
2016-09-20 17:45:45 -07:00
tuxoko
4329bd5b73 Cleanup in cred.h
Remove the code that doesn't make any sense.

Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #569
2016-09-14 16:59:31 -07:00
Dan Kimmel
524b4217b8 DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone()
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:59:13 -07:00
Dan Kimmel
c4434877ae Remove lint suppression from dmu.h and unnecessary dmu.h include in spa.h
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:59:09 -07:00
Dan Kimmel
2aa34383b9 DLPX-40252 integrate EP-476 compressed zfs send/receive
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:58:58 -07:00
George Wilson
d3c2ae1c08 OpenZFS 6950 - ARC should cache compressed data
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
2016-09-13 09:58:33 -07:00
Don Brady
d02ca37979 Bring over illumos ZFS FMA logic -- phase 1
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4673
2016-09-01 11:39:45 -07:00
luozhengzheng
0b284702b7 Delete unreferenced function zfs_ereport_send_interim_checksum
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5055
2016-09-01 11:39:45 -07:00
Gvozden Neskovic
ee36c709c3 Performance optimization of AVL tree comparator functions
perf: 2.75x faster ddt_entry_compare()
    First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently.  Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old	6.85789 s
new	2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
    Compute the result directly instead of using conditionals

perf: zfs_range_compare()
    Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
    `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033
2016-08-31 14:35:34 -07:00
cao
8f50bafb04 Delete unused zfsctl_snapdir_inactive declaration
zfsctl_snapdir_inactive is defined in zfs-0.6.3.  In zfs-0.6.5.7
this is declaration remains even though the implementation was
removed in commit 278bee93.  Removed fastreboot_disable_highpil
which is also unused.

Signed-off-by: caoxuewen cao.xuewen@zte.com.cn
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5042
2016-08-30 14:33:40 -07:00
Alexander Motin
755065f3dc OpenZFS 6322 - ZFS indirect block predictive prefetch
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.

Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413
Closes #5040

Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
  to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'

- Difference from upstream in module/zfs/dmu_zfetch.c,
  uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance

- Variables have been initialized at the beginning of the function
 (void dmu_zfetch) to resemble the order of occurrence and account
 for C99, C11 mode errors.
2016-08-30 14:26:55 -07:00
Gvozden Neskovic
9cc1844a1d Linux compat: Grsecurity kernel
API Change: Module parameter set/get methods take const parameter in
Grsecurity kernel v4.7.1

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4997
Closes #5001
2016-08-22 10:05:45 -07:00
Matthew Ahrens
2bce8049c3 OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109
Closes #4641
Closes #4972
2016-08-19 12:48:03 -07:00
Matthew Ahrens
8bea981504 OpenZFS 7003 - zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7003
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108
Closes #4972
2016-08-19 12:35:23 -07:00
Paul Dagnelie
32d41fb73a OpenZFS 7176 - Yet another hole birth issue
This is another bug in the long line of hole-birth related issues. In
this particular case, it was discovered that a previous hole-birth fix
(illumos bug 6513, commit bc77ba73) did not cover as many cases as we
thought it did. While the issue worked in the case of hole-punching
(writing zeroes to a large part of a file), it did not deal with
truncation, and then writing beyond the new end of the file.

The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7176
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad
Upstream-bugs: DLPX-46009

Porting notes:
- Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ;
  keep previous position of the initialization
2016-08-18 09:26:44 -07:00
Gvozden Neskovic
fc897b24b2 Rework of fletcher_4 module
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset
overhead of time function.

- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()`
are introduced.

- Implementation mutex lock is replaced by atomic variable.

- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.

- `fletcher_4_native/byteswap()` methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).

- Added `fletcher_4_native_varsize()` special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.

- Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.

Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`:
implementation   native         byteswap
scalar           4768120823     3426105750
sse2             7947841777     4318964249
ssse3            7951922722     6112191941
avx2             13269714358    11043200912
fastest          avx2           avx2

Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`:
implementation   native         byteswap
scalar           1291115967     1031555336
sse2             2539571138     1280970926
ssse3            2537778746     1080016762
avx2             4950749767     1078493449
avx512f          9581379998     4010029046
fastest          avx512f        avx512f

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4952
2016-08-16 14:11:55 -07:00
Gvozden Neskovic
70b258fc96 Fletcher4 implementation using avx512f instruction set
Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.

Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.

Table shows result of full (native) fletcher4 calculation for different buffer size:

fletcher4   4KB     16KB    64KB    128KB   256KB   1MB     16MB
--------------------------------------------------------------------
[scalar]    1213    1228    1231    1231    1225    1200    1160
[sse2]      2374    2442    2459    2456    2462    2250    2220
[avx2]      4288    4753    4871    4893    4900    4050    3882
[avx512f]   5975    8445    9196    9221    9262    6307    5620

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
2016-08-16 14:11:14 -07:00
Gvozden Neskovic
32ffaa3de5 Add support for AVX-512 family of instruction sets
This patch adds compiler and runtime tests (user and kernel) for following
instruction sets: avx512f, avx512cd, avx512er, avx512pf, avx512bw, avx512dq,
avx512vl, avx512ifma, avx512vbmi.

note: Linux support for AVX-512F (Foundation) instruction set started with
linux v3.15

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
2016-08-16 14:10:33 -07:00
Hans Rosenfeld
fb390aafc8 OpenZFS 5997 - FRU field not set during pool creation and never updated
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283

Porting Notes:

In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.

Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux.  In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.

The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.

  ereport.fs.zfs.config.sync         sysevent.fs.zfs.config_sync
  ereport.fs.zfs.zpool.destroy       sysevent.fs.zfs.pool_destroy
  ereport.fs.zfs.zpool.reguid        sysevent.fs.zfs.pool_reguid
  ereport.fs.zfs.vdev.remove         sysevent.fs.zfs.vdev_remove
  ereport.fs.zfs.vdev.clear          sysevent.fs.zfs.vdev_clear
  ereport.fs.zfs.vdev.check          sysevent.fs.zfs.vdev_check
  ereport.fs.zfs.vdev.spare          sysevent.fs.zfs.vdev_spare
  ereport.fs.zfs.vdev.autoexpand     sysevent.fs.zfs.vdev_autoexpand
  ereport.fs.zfs.resilver.start      sysevent.fs.zfs.resilver_start
  ereport.fs.zfs.resilver.finish     sysevent.fs.zfs.resilver_finish
  ereport.fs.zfs.scrub.start         sysevent.fs.zfs.scrub_start
  ereport.fs.zfs.scrub.finish        sysevent.fs.zfs.scrub_finish
  ereport.fs.zfs.bootfs.vdev.attach  sysevent.fs.zfs.bootfs_vdev_attach
2016-08-12 13:06:48 -07:00
luozhengzheng
834f1e426c Fix a typo in ZIL write handling comment
The following comment in zil.h

 * WR_COPIED:
 *    If we know we'll immediately be committing the
 *    transaction (FSYNC or FDSYNC), then we allocate a larger
 *    log record here for the data and copy the data in.

The word "the" should be "then".

Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4961
2016-08-12 10:30:16 -07:00
Brian Behlendorf
6eb73b0046 Reorder HAVE_BIO_RW_* checks
The HAVE_BIO_RW_* #ifdef's must appear before REQ_* #ifdef's
in the bio_is_flush() and bio_is_discard() macros.  Linux 2.6.32
era kernels defined both of values and the HAVE_BIO_RW_* must be
used in this case.  This resulted in a panic in zconfig test 5.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
Closes #4959
2016-08-12 09:17:40 -07:00
Chen Haiquan
d9c97ec08b Use file_dentry and file_inode wrappers
Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

This problem crashes system when use zfs as a layer of overlayfs.

Signed-off-by: Chen Haiquan <oc@yunify.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4914
Closes #4935
2016-08-11 12:06:37 -07:00
GeLiXin
d5884c3453 Fix indefinite article
The indefinite article before nvlist should be "an", not "a".

We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should
stay the same as we are such a strict filesystem.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4941
2016-08-11 11:23:49 -07:00
Brian Behlendorf
e5fe9ddeec Remove custom root pool import code
Non-Linux OpenZFS implementations require additional support to be
used a root pool.  This code should simply be removed to avoid
confusion and improve readability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
2016-08-11 11:19:34 -07:00
Brian Behlendorf
cf41432c70 Linux 4.8 compat: Fix removal of bio->bi_rw member
All users of bio->bi_rw have been replaced with compatibility wrappers.
This allows the kernel specific logic to be abstracted away, and for
each of the supported cases to be documented with the wrapper.  The
updated interfaces are as follows:

* void blk_queue_set_write_cache(struct request_queue *, bool, bool)
* boolean_t bio_is_flush(struct bio *)
* boolean_t bio_is_fua(struct bio *)
* boolean_t bio_is_discard(struct bio *)
* boolean_t bio_is_secure_erase(struct bio *)
* VDEV_WRITE_FLUSH_FUA

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4951
2016-08-11 11:19:34 -07:00
Brian Behlendorf
4b908d3220 Linux 4.8 compat: posix_acl_valid()
The posix_acl_valid() function has been updated to require a
user namespace.  Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4922
2016-08-08 11:46:40 -07:00
Brian Behlendorf
e85a6396b0 Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING
Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING
configure checks, all supported kernel provide this functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4922
2016-08-08 11:46:32 -07:00
Nikolay Borisov
64aefee1b8 Fix interaction between userns uid/gid and SA
* When the uid/gid change is handled in zfs_setattr we want to
actually adjust the user passed uid to a KUID and write that to disk.

* In trace points use the i_uid member without doing translation,
since it has already been performed.

* Use kuid in zfs_aclset_common

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4928
2016-08-08 10:47:43 -07:00
Nikolay Borisov
938cfeb0f2 Linux 4.8 compat: new s_user_ns member of struct super_block
Kernel 4.8 paved the way to enabling mounting a file system inside a
non-init user namespace. To facilitate this a s_user_ns member was
added holding the userns in which the filesystem's instance was
mounted. This enables doing the uid/gid translation relative to
this particular username space and not the default init_user_ns.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4928
2016-08-08 10:47:22 -07:00
Chunwei Chen
3b86aeb295 Linux 4.8 compat: REQ_OP and bio_set_op_attrs()
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags.  This included
changing the encoding of bi_rw.  It places REQ_OP_* in high order
bits and other stuff in low order bits.  This encoding is done
through the new helper function bio_set_op_attrs.  For complete
details refer to:

https://github.com/torvalds/linux/commit/f215082
https://github.com/torvalds/linux/commit/4e1b2d5

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4892
Closes #4899
2016-07-29 14:48:19 -07:00
Brian Behlendorf
76e5f6fe10 Linux 4.8 compat: REQ_PREFLUSH
The REQ_FLUSH flag was renamed REQ_PREFLUSH to avoid confusion with
REQ_OP_FLUSH.  See https://github.com/torvalds/linux/commit/28a8f0d3
for complete details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
2016-07-29 14:48:09 -07:00
Brian Behlendorf
b7c7008ba2 Linux 4.8 compat: rw_semaphore atomic_long_t count
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type.  A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro.  See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #563
2016-07-29 14:17:53 -07:00
Tim Chase
25458cbef9 Limit the amount of dnode metadata in the ARC
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made.  The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773
Closes #4858
2016-07-25 15:26:38 -07:00
Tim Chase
e6603b7c1f Fix sync behavior for disk vdevs
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4858
2016-07-25 14:24:47 -07:00
Nikolay Borisov
2c6abf15ff Remove znode's z_uid/z_gid member
Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
2016-07-25 13:21:49 -07:00
Nikolay Borisov
82a1b2d628 Check whether the kernel supports i_uid/gid_read/write helpers
Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
2016-07-25 13:21:49 -07:00
Tom Caputi
0b04990a5d Illumos Crypto Port module added to enable native encryption in zfs
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
2016-07-20 10:43:30 -07:00
Tom Caputi
d2f97b2a26 Added highbit() and lowbit() macros
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #562
2016-07-20 10:28:46 -07:00
Gvozden Neskovic
26a08b5ca9 RAIDZ parity kstat rework
Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4860
2016-07-19 16:43:07 -07:00
Gvozden Neskovic
c9187d867f Fixes and enhancements of SIMD raidz parity
- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
2016-07-19 16:43:07 -07:00
Tyler J. Stachecki
35a76a0366 Implementation of SSE optimized Fletcher-4
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789
2016-07-15 10:42:35 -07:00
Chris Dunlop
dfbc86309f Use native inode->i_nlink instead of znode->z_links
A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4838
Issue #227
2016-07-14 16:25:34 -07:00
Gvozden Neskovic
ae25d22235 Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.
The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783
2016-07-13 10:24:55 -07:00
Chunwei Chen
31b6111fd9 Kill zp->z_xattr_parent to prevent pinning
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
2016-07-12 14:18:10 -07:00
Igor Kozhukhov
eca7b76001 OpenZFS 6314 - buffer overflow in dsl_dataset_name
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee
2016-06-28 13:47:03 -07:00
Brian Behlendorf
43e52eddb1 Implement zfs_ioc_recv_new() for OpenZFS 2605
Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-06-28 13:47:03 -07:00
Andrew Stormont
b607405fea OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b
2016-06-28 13:47:03 -07:00
Paul Dagnelie
e6d3a843d6 OpenZFS 6393 - zfs receive a full send as a clone
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e
2016-06-28 13:47:03 -07:00
Brian Behlendorf
fd41e93563 OpenZFS 6051 - lzc_receive: allow the caller to read the begin record
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6051
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322
2016-06-28 13:47:02 -07:00
Matthew Ahrens
47dfff3b86 OpenZFS 2605, 6980, 6902
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
2016-06-28 13:47:02 -07:00
Brian Behlendorf
669cf0ab29 Sync DMU_BACKUP_FEATURE_* flags
Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
then reserved in the upstream OpenZFS implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #4795
2016-06-24 15:14:27 -07:00
Ned Bass
50c957f702 Implement large_dnode pool feature
Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-06-24 13:13:21 -07:00
Ned Bass
68cbd56e18 Backfill metadnode more intelligently
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-06-24 13:13:12 -07:00
Tony Hutter
5ad98ad097 Add _ALIGNMENT_REQUIRED to isa_defs.h for checksums
_ALIGNMENT_REQUIRED needs to be #defined in isa_defs.h in order to
port the Illumos checksum code to ZoL:

4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #561
2016-06-21 13:37:04 -07:00
Paul Dagnelie
bc77ba73fe OpenZFS 6513 - partially filled holes lose birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.
2016-06-21 10:55:13 -07:00
Gvozden Neskovic
ab9f4b0b82 SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-06-21 09:27:26 -07:00
Brian Behlendorf
46ab35954c Remove libzfs_graph.c
The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4766
2016-06-16 13:53:15 -07:00
Brian Behlendorf
f74b821a66 Add zfs allow and zfs unallow support
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487
2016-06-07 09:16:52 -07:00
Jinshan Xiong
1eeb4562a7 Implementation of AVX2 optimized Fletcher-4
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330
2016-06-02 14:30:51 -07:00
Brian Behlendorf
8fbbc6b4cf Linux 4.7 compat: handler->set() takes both dentry and inode
Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4717 
Issue #4665
2016-06-01 18:10:06 -07:00
Chunwei Chen
f58040c0fc Implement a proper rw_tryupgrade
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes #554
2016-05-31 11:44:15 -07:00
YunQiang Su
c60a51b640 Add isa_defs for MIPS
GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #558
2016-05-31 09:05:56 -07:00
Tony Hutter
26ef0cc7db OpenZFS 6531 - Provide mechanism to artificially limit disk performance
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
2016-05-26 10:11:51 -07:00
Tony Hutter
7e945072d1 Add request size histograms (-r) to zpool iostat, minor man page fix
Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659
2016-05-25 15:49:35 -07:00
Chunwei Chen
9baaa7deae Linux 4.7 compat: use iterate_shared for concurrent readdir
Register iterate_shared if it exists so the kernel will used shared
lock and allowing concurrent readdir.

Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE
to allow concurrent seeking.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4664
Closes #4665
2016-05-20 11:09:16 -07:00
Chunwei Chen
68e8f59afb Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4665
2016-05-20 11:08:55 -07:00
Chunwei Chen
fd4c7b7a73 Linux 4.7 compat: handler->get() takes both dentry and inode
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4665
2016-05-20 11:08:21 -07:00
Chunwei Chen
fdbc1ba99d Linux 4.7 compat: inode_lock() and friends
Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.

We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.

We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes #549
2016-05-20 11:00:14 -07:00
Nikolay Borisov
278f223668 Kill znode->z_gen field
This field is a duplicate of the inode->i_generation, so just
kill it.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4538
Closes #4654
2016-05-19 13:06:14 -07:00
Brian Behlendorf
ada8258141 Revert "zhack: Add 'feature disable' command"
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:52:07 -07:00
Brian Behlendorf
8302528617 zhack: Add 'feature disable' command
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
2016-05-17 11:00:21 -07:00
Boris Protopopov
e3a07cd033 Use zfs range locks in ztest
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest.  This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4023
Issue #4024
2016-05-17 10:40:30 -07:00
Chunwei Chen
d88895a069 Remove dummy znode from zvol_state
struct zvol_state contains a dummy znode, which is around 1KB on x64,
only for zfs_range_lock. But in reality, other than z_range_lock and
z_range_avl, zfs_range_lock only need znode on regular file, which
means we add 1KB on a structure and gain nothing.

In this patch, we remove the dummy znode for zvol_state. In order to
do that, we also need to refactor zfs_range_lock a bit. We move
z_range_lock and z_range_avl pair out of znode_t to form zfs_rlock_t.
This new struct replaces znode_t as the main handle inside the range
lock functions.

We also add pointers to z_size, z_blksz, and z_max_blksz so range lock
code doesn't depend on znode_t.  This allows non-ZPL consumers like
Lustre to use the range locks with their equivalent znode_t structure.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4510
2016-05-17 10:29:02 -07:00
Denys Rtveliashvili
206971d234 OpenZFS 6739 - assumption in cv_timedwait_hires
Userland version of cv_timedwait_hires() always assumes absolute time.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6739
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/41c6413

Porting Notes:
The ported change has revealed a number of problems in the Linux-specific code,
as it was expecting incorrect return codes from pthread_* functions.
Reviewed and improved the usage of pthread_* function in lib/libzpool/kernel.c.
2016-05-15 15:18:25 -07:00
Chunwei Chen
a9bb2b6827 Use cv_timedwait_sig_hires in arc_reclaim_thread
The was originally using interruptible cv_timedwait_sig, but was changed
to uninterruptible cv_timedwait_hires in ae6d0c6. Use _sig_hires instead
to allow interruptible sleep.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4633
Closes #4634
2016-05-12 14:56:47 -07:00
Chunwei Chen
39cd90ef08 Add cv_timedwait_sig_hires to allow interruptible sleep
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #548
2016-05-12 14:54:15 -07:00
Brian Behlendorf
c15706490e Revert "Kill znode->z_gen field"
This reverts commit 4cd77889b6.  The
i_generation field in the inode is 32-bit and the SA code expects
64-bit fixed values.  Revert this optimization for now until
this is cleanly addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4538
2016-05-12 13:36:22 -07:00
Tony Hutter
193a37cb24 Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues
Update the zfs module to collect statistics on average latencies, queue sizes,
and keep an internal histogram of all IO latencies.  Along with this, update
"zpool iostat" with some new options to print out the stats:

-l: Include average IO latencies stats:

 total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
 read  write   read  write   read  write   read  write   wait
-----  -----  -----  -----  -----  -----  -----  -----  -----
    -   41ms      -    2ms      -   46ms      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -    5ms      -    1ms      -    1us      -    4ms      -
    -      -      -      -      -      -      -      -      -
    -   49ms      -    2ms      -   47ms      -      -      -
    -      -      -      -      -      -      -      -      -
    -    2ms      -    1ms      -      -      -    1ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  1ms    1ms    1ms  413us   16us   25us      -    5ms      -
  2ms    1ms    2ms  412us   26us   25us      -    5ms      -
    -    1ms      -  413us      -   25us      -    5ms      -
    -    1ms      -  460us      -   29us      -    5ms      -
196us    1ms  196us  370us    7us   23us      -    5ms      -
-----  -----  -----  -----  -----  -----  -----  -----  -----

-w: Print out latency histograms:

sdb           total           disk         sync_queue      async_queue
latency    read   write    read   write    read   write    read   write   scrub
-------  ------  ------  ------  ------  ------  ------  ------  ------  ------
1ns           0       0       0       0       0       0       0       0       0
...
33us          0       0       0       0       0       0       0       0       0
66us          0       0     107    2486       2     788      12      12       0
131us         2     797     359    4499      10     558     184     184       6
262us        22     801     264    1563      10     286     287     287      24
524us        87     575      71   52086      15    1063     136     136      92
1ms         152    1190       5   41292       4    1693     252     252     141
2ms         245    2018       0   50007       0    2322     371     371     220
4ms         189    7455      22  162957       0    3912    6726    6726     199
8ms         108    9461       0  102320       0    5775    2526    2526      86
17ms         23   11287       0   37142       0    8043    1813    1813      19
34ms          0   14725       0   24015       0   11732    3071    3071       0
67ms          0   23597       0    7914       0   18113    5025    5025       0
134ms         0   33798       0     254       0   25755    7326    7326       0
268ms         0   51780       0      12       0   41593   10002   10002       0
537ms         0   77808       0       0       0   64255   13120   13120       0
1s            0  105281       0       0       0   83805   20841   20841       0
2s            0   88248       0       0       0   73772   14006   14006       0
4s            0   47266       0       0       0   29783   17176   17176       0
9s            0   10460       0       0       0    4130    6295    6295       0
17s           0       0       0       0       0       0       0       0       0
34s           0       0       0       0       0       0       0       0       0
69s           0       0       0       0       0       0       0       0       0
137s          0       0       0       0       0       0       0       0       0
-------------------------------------------------------------------------------

-h: Help

-H: Scripted mode. Do not display headers, and separate fields by a single
    tab instead of arbitrary space.

-q: Include current number of entries in sync & async read/write queues,
    and scrub queue:

 syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
 pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0     78     29      0      0      0      0
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
    -      -      -      -      -      -      -      -      -      -
    0      0      0      0      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----
    0      0    227    394      0     19      0      0      0      0
    0      0    227    394      0     19      0      0      0      0
    0      0    108     98      0     19      0      0      0      0
    0      0     19     98      0      0      0      0      0      0
    0      0     78     98      0      0      0      0      0      0
    0      0     19     88      0      0      0      0      0      0
-----  -----  -----  -----  -----  -----  -----  -----  -----  -----

-p: Display numbers in parseable (exact) values.

Also, update iostat syntax to allow the user to specify specific vdevs
to show statistics for.  The three options for choosing pools/vdevs are:

Display a list of pools:
    zpool iostat ... [pool ...]

Display a list of vdevs from a specific pool:
    zpool iostat ... [pool vdev ...]

Display a list of vdevs from any pools:
    zpool iostat ... [vdev ...]

Lastly, allow zpool command "interval" value to be floating point:
    zpool iostat -v 0.5

Signed-off-by: Tony Hutter <hutter2@llnl.gov
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4433
2016-05-12 12:36:32 -07:00
Adam Stevko
2a8b84b747 OpenZFS 3993, 4700
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options

Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352

Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses.  The functionality is the same.
2016-05-11 11:49:37 -07:00
David Quigley
5e39e4f0b2 Add a macro to convert seconds to nanoseconds and vice-versa
Required infrastructure for zfsonlinux/zfs#4600.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #546
2016-05-05 16:10:46 -07:00
Tony Hutter
f7c63cda90 OpenZFS 6544 - incorrect comment in libzfs.h about offline status
6544 incorrect comment in libzfs.h about offline status
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6544
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb605c4
Closes #4595
2016-05-05 09:30:05 -07:00
Joe Stein
e0ab3ab553 OpenZFS 6736 - ZFS per-vdev ZAPs
6736 ZFS per-vdev ZAPs
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/6736
  https://github.com/openzfs/openzfs/commit/215198a

Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4515
2016-05-02 14:27:45 -07:00
Nikolay Borisov
4cd77889b6 Kill znode->z_gen field
This field is a duplicate of the inode->i_generation, so just kill it

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4538
2016-05-02 11:22:31 -07:00
Brian Behlendorf
874bd959f4 Fix user namespaces uid/gid mapping
As described in torvalds/linux@5f3a4a2 the &init_user_ns, and
not the current user_ns, should be passed to posix_acl_from_xattr()
and posix_acl_to_xattr().  Conveniently the init_user_ns is
available through the init credential (kcred).


Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Closes #4177
2016-04-30 12:21:51 -07:00
Alex Reece
463a8cfe2b Illumos 6844 - dnode_next_offset can detect fictional holes
6844 dnode_next_offset can detect fictional holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>

dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.

When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.

I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.

The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.

References:
  https://www.illumos.org/issues/6844
  https://github.com/openzfs/openzfs/pull/82
  DLPX-35372

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4548
2016-04-27 16:24:15 -07:00
Tim Chase
3bf657b90c Use vmem_free() in dfl_free() and add dfl_alloc()
This change was lost, somehow, in e5f9a9a.  Since the arrays can be
rather large, they need to be allocated with vmem_zalloc() via dfl_alloc()
and freed with vmem_free() via dfl_free().

The new dfl_alloc() function should be used to allocate object of type
dkioc_free_list_t in order that they're allocated from vmem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes #543
2016-04-26 11:20:14 -07:00
Chunwei Chen
cdd39dd245 Use kernel provided mutex owner
To reduce mutex footprint, we detect the existence of owner in kernel mutex,
and rely on it if it exists.

Note that before Linux 3.0, mutex owner is of type thread_info. Also note
that, in Linux 3.18, the condition for owner is changed from
CONFIG_DEBUG_MUTEXES || CONFIG_SMP to
CONFIG_DEBUG_MUTEXES || CONFIG_MUTEX_SPIN_ON_OWNER

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #540
2016-04-25 17:04:07 -07:00
Brian Behlendorf
da5e151f20 Add pn_alloc()/pn_free() functions
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented.  The existing illumos
implementation were used for this purpose.

The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions.  This removes a small point of divergence
between the ZoL code and upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4522
2016-04-21 09:49:25 -07:00
Chunwei Chen
6760077194 Make zfs mount according to relatime config in dataset
Also enable lazytime in mount.zfs

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
2016-04-05 18:55:59 -07:00
Chunwei Chen
0df9673f01 Fix atime handling and relatime
The problem for atime:

We have 3 places for atime: inode->i_atime, znode->z_atime and SA. And its
handling is a mess. A huge part of mess regarding atime comes from
zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave
inconsistently with those three values.

zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you
don't pass ATTR_ATIME. Which means every write(2) operation which only updates
ctime and mtime will cause atime changes to not be written to disk.

Also zfs_inode_update from write(2) will replace inode->i_atime with what's
inside SA(stale). But doesn't touch z_atime. So after read(2) and write(2).
You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.

Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's
inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new),
SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll
leave with a stale atime.

The problem for relatime:

We do have a relatime config inside ZFS dataset, but how it should interact
with the mount flag MS_RELATIME is not well defined. It seems it wanted
relatime mount option to override the dataset config by showing it as
temporary in `zfs get`. But at the same time, `zfs set relatime=on|off` would
also seems to want to override the mount option. Not to mention that
MS_RELATIME flag is actually never passed into ZFS, so it never really worked.

How Linux handles atime:

The Linux kernel actually handles atime completely in VFS, except for writing
it to disk. So if we remove the atime handling in ZFS, things would just work,
no matter it's strictatime, relatime, noatime, or even O_NOATIME. And whenever
VFS updates the i_atime, it will notify the underlying filesystem via
sb->dirty_inode().

And also there's one thing to note about atime flags like MS_RELATIME and
other flags like MS_NODEV, etc. They are mount point flags rather than
filesystem(sb) flags. Since native linux filesystem can be mounted at multiple
places at the same time, they can all have different atime settings. So these
flags are never passed down to filesystem drivers.

What this patch tries to do:

We remove znode->z_atime, since we won't gain anything from it. We remove most
of the atime handling and leave it to VFS. The only thing we do with atime is
to write it when dirty_inode() or setattr() is called. We also add
file_accessed() in zpl_read() since it's not provided in vfs_read().

After this patch, only the MS_RELATIME flag will have effect. The setting in
dataset won't do anything. We will make zfstuil to mount ZFS with MS_RELATIME
set according to the setting in dataset in future patch.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
2016-04-05 18:54:55 -07:00
Don Brady
39fc0cb557 Add support for devid and phys_path keys in vdev disk labels
This is foundational work for ZED.

Updates a leaf vdev's persistent device strings on Linux platform

* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES

Some examples:

    path: '/dev/sdb1'
    devid: 'scsi-350000394a8ca4fbc-part1'
    phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'

    path: '/dev/mapper/mpatha'
    devid: 'dm-uuid-mpath-35000c5006304de3f'

Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2856
Closes #3978
Closes #4416
2016-03-31 13:45:53 -07:00
Brian Behlendorf
726c4a2565 Remove complicated libspl assert wrappers
Effectively provide our own version of assert()/verify() for use
in user space.  This minimizes our dependencies and aligns the
user space assertion handling with what's used in the kernel.

Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4449
2016-03-30 12:26:42 -07:00
Carlo Landmeter
1a01c207cb Ensure correct return value type
When compiling with musl libc the return type will be incorrect.

Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4454
2016-03-29 18:33:17 -07:00
Gvozden Neskovic
fc0c72b167 Support for vectorized algorithms on x86
This is initial support for x86 vectorized implementations of ZFS parity
and checksum algorithms.

For the compilation phase, configure step checks if toolchain supports relevant
instruction sets. Each implementation must ensure that the code is not passed
to compiler if relevant instruction set is not supported. For this purpose,
following new defines are provided if instruction set is supported:
	- HAVE_SSE,
	- HAVE_SSE2,
	- HAVE_SSE3,
	- HAVE_SSSE3,
	- HAVE_SSE4_1,
	- HAVE_SSE4_2,
	- HAVE_AVX,
	- HAVE_AVX2.

For detecting if an instruction set can be used in runtime, following functions
are provided in (include/linux/simd_x86.h):
	- zfs_sse_available()
	- zfs_sse2_available()
	- zfs_sse3_available()
	- zfs_ssse3_available()
	- zfs_sse4_1_available()
	- zfs_sse4_2_available()
	- zfs_avx_available()
	- zfs_avx2_available()
	- zfs_bmi1_available()
	- zfs_bmi2_available()

These function should be called once, on module load, or initialization.
They are safe to use from user and kernel space.
If an implementation is using more than single instruction set, both compiler
and runtime support for all relevant instruction sets should be checked.

Kernel fpu methods:
	- kfpu_begin()
	- kfpu_end()

Use __get_cpuid_max and __cpuid_count from <cpuid.h>
Both gcc and clang have support for these. They also handle ebx register
in case it is used for PIC code.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #4381
2016-03-21 09:24:34 -07:00
Dimitri John Ledkov
224817e2a8 Add support for s390[x].
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #537
2016-03-17 09:54:49 -07:00
Brian Behlendorf
6bb24f4dc7 Add the ZFS Test Suite
Add the ZFS Test Suite and test-runner framework from illumos.
This is a continuation of the work done by Turbo Fredriksson to
port the ZFS Test Suite to Linux.  While this work was originally
conceived as a stand alone project integrating it directly with
the ZoL source tree has several advantages:

  * Allows the ZFS Test Suite to be packaged in zfs-test package.
    * Facilitates easy integration with the CI testing.
    * Users can locally run the ZFS Test Suite to validate ZFS.
      This testing should ONLY be done on a dedicated test system
      because the ZFS Test Suite in its current form is destructive.
  * Allows the ZFS Test Suite to be run directly in the ZoL source
    tree enabled developers to iterate quickly during development.
  * Developers can easily add/modify tests in the framework as
    features are added or functionality is changed.  The tests
    will then always be in sync with the implementation.

Full documentation for how to run the ZFS Test Suite is available
in the tests/README.md file.

Warning: This test suite is designed to be run on a dedicated test
system.  It will make modifications to the system including, but
not limited to, the following.

  * Adding new users
  * Adding new groups
  * Modifying the following /proc files:
    * /proc/sys/kernel/core_pattern
    * /proc/sys/kernel/core_uses_pid
  * Creating directories under /

Notes:
  * Not all of the test cases are expected to pass and by default
    these test cases are disabled.  The failures are primarily due
    to assumption made for illumos which are invalid under Linux.
  * When updating these test cases it should be done in as generic
    a way as possible so the patch can be submitted back upstream.
    Most existing library functions have been updated to be Linux
    aware, and the following functions and variables have been added.
    * Functions:
      * is_linux          - Used to wrap a Linux specific section.
      * block_device_wait - Waits for block devices to be added to /dev/.
    * Variables:            Linux          Illumos
      * ZVOL_DEVDIR         "/dev/zvol"    "/dev/zvol/dsk"
      * ZVOL_RDEVDIR        "/dev/zvol"    "/dev/zvol/rdsk"
      * DEV_DSKDIR          "/dev"         "/dev/dsk"
      * DEV_RDSKDIR         "/dev"         "/dev/rdsk"
      * NEWFS_DEFAULT_FS    "ext2"         "ufs"
  * Many of the disabled test cases fail because 'zfs/zpool destroy'
    returns EBUSY.  This is largely causes by the asynchronous nature
    of device handling on Linux and is expected, the impacted test
    cases will need to be updated to handle this.
  * There are several test cases which have been disabled because
    they can trigger a deadlock.  A primary example of this is to
    recursively create zpools within zpools.  These tests have been
    disabled until the root issue can be addressed.
  * Illumos specific utilities such as (mkfile) should be added to
    the tests/zfs-tests/cmd/ directory.  Custom programs required by
    the test scripts can also be added here.
  * SELinux should be either is permissive mode or disabled when
    running the tests.  The test cases should be updated to conform
    to a standard policy.
  * Redundant test functionality has been removed (zfault.sh).
  * Existing test scripts (zconfig.sh) should be migrated to use
    the framework for consistency and ease of testing.
  * The DISKS environment variable currently only supports loopback
    devices because of how the ZFS Test Suite expects partitions to
    be named (p1, p2, etc).  Support must be added to generate the
    correct partition name based on the device location and name.
  * The ZFS Test Suite is part of the illumos code base at:
    https://github.com/illumos/illumos-gate/tree/master/usr/src/test

Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6
Closes #1534
2016-03-16 13:46:16 -07:00
Brian Behlendorf
a6ae97caed Add rw_tryupgrade()
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms.  It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held.  On other platforms the lock is never released during
the upgrade process.  This is necessary under Linux because the kernel
does not provide an upgrade function.

There are currently no callers in the ZFS code where this change in
behavior is a problem.  In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534
2016-03-10 13:05:25 -08:00
Boris Protopopov
a0bd735adb Add support for asynchronous zvol minor operations
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset

zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()

Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.

* Each taskq must be single threaded to ensure tasks are always
  processed in the order in which they were dispatched.

* There is a taskq per-pool in order to keep the pools independent.
  This way if one pool is suspended it will not impact another.

* The preferred location to dispatch a zvol minor task is a sync
  task.  In this context there is easy access to the spa_t and
  minimal error handling is required because the sync task must
  succeed.

Support for asynchronous zvol minor operations address issue #3681.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2217
Closes #3678
Closes #3681
2016-03-10 09:49:22 -08:00
Thijs Cramer
95003f7098 Updated paths to scan when importing zpool(s)
Added by-partlabel and by-partuuid to the default device search
path.  Made made device names in by-label more preferable.

Signed-off-by: Thijs Cramer <thijs.cramer@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3892
2016-03-09 10:41:23 -08:00
Brian Behlendorf
7d11e37e55 Require libblkid
Historically libblkid support was detected as part of configure
and optionally enabled.  This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.

This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools.  For distributions
which include a modern version of libblkid there is no change in
behavior.  Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.

Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2448
2016-03-09 10:39:22 -08:00
smh
9f500936c8 FreeBSD r256956: Improve ZFS N-way mirror read performance by using load and locality information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay

References:
  https://github.com/freebsd/freebsd@5c7a6f5d
  https://github.com/freebsd/freebsd@31b7f68d
  https://github.com/freebsd/freebsd@e186f564

Performance Testing:
  https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141

Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
  collegue Andriy Gapon has been included. It applied perfectly, but
  added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4334
2016-02-26 11:24:35 -08:00
Richard Yao
d2f3e292dc Add -gLp to zpool subcommands for alt vdev names
The following options have been added to the zpool add, iostat,
list, status, and split subcommands.  The default behavior was
not modified, from zfs(8).

  -g    Display vdev GUIDs  instead  of  the  normal  short
        device  names.  These GUIDs can be used in-place of
        device   names   for    the    zpool    detach/off‐
        line/remove/replace commands.

  -L    Display real paths for vdevs resolving all symbolic
        links. This can be used to lookup the current block
        device  name regardless of the /dev/disk/ path used
        to open it.

  -p    Display  full  paths  for vdevs instead of only the
        last component of the path.  This can  be  used  in
        conjunction with the -L flag.

This behavior may also be enabled using the following environment
variables.

  ZPOOL_VDEV_NAME_GUID
  ZPOOL_VDEV_NAME_FOLLOW_LINKS
  ZPOOL_VDEV_NAME_PATH

This change is based on worked originally started by Richard Yao
to add a -g option.  Then extended by @ilovezfs to add a -L option
for openzfsonosx.  Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.

Original-patch-by: Richard Yao <ryao@gentoo.org>
Extended-by: ilovezfs <ilovezfs@icloud.com>
Extended-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2011
Closes #4341
2016-02-25 11:58:39 -08:00
Tom Caputi
18d2f56176 Changes to support zfs encryption
Unused modlinkage struct removed and ntohll functions added.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #533
2016-02-25 11:42:46 -08:00
Richard Yao
0b43696e66 random_get_pseudo_bytes() need not provide cryptographic strength entropy
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2.  Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372
2016-02-17 09:49:09 -08:00
Chunwei Chen
8f3b403a73 Allow kicking a taskq to spawn more threads
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529
2016-02-05 14:08:31 -08:00
Brian Behlendorf
b6fcb792ca Illumos 6414 - vdev_config_sync could be simpler
6414 vdev_config_sync could be simpler
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6414
  https://github.com/illumos/illumos-gate/commit/eb5bb58

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
2016-01-28 12:44:39 -05:00
Richard Yao
be29e6a6e6 kobj_read_file: Return -1 on vn_rdwr() error
I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.

Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513
2016-01-23 10:10:44 -08:00
Brian Behlendorf
519129ff4e Illumos 6815179, 6844191
6815179 zpool import with a large number of LUNs is too slow
6844191 zpool import, scanning of disks should be multi-threaded

References:
  https://github.com/illumos/illumos-gate/commit/4f67d75

Porting notes:
- This change was originally never ported to Linux due to it
  dependence on the thread pool interface.  This patch solves
  that issue by switching the code to use the existing taskq
  implementation which provides the same basic functionality.
  However, in order for this to work properly thread_init()
  and thread_fini() must be called around to taskq consumer
  to perform the needed thread initialization.

- The check_one_slice, nozpool_all_slices, and check_slices
  functions have been disabled for Linux.  They are difficult,
  but possible, to implement for Linux due to how partitions
  are get names.  Since this is only an optimization this code
  can be added at a latter date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-22 09:39:46 -08:00
Matthew Ahrens
19d55079ae Illumos 4950 - files sometimes can't be removed from a full filesystem
4950 files sometimes can't be removed from a full filesystem
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4950
  https://github.com/illumos/illumos-gate/commit/4bb7380

Porting notes:
- ZoL currently does not log discards to zvols, so the portion of
  this patch that modifies the discard logging to mark it as
  freeing space has been discarded.

2. may_delete_now had been removed from zfs_remove() in ZoL.
   It has been reintroduced.

3. We do not try to emulate vnodes, so the following lines are
   not valid on Linux:

	mutex_enter(&vp->v_lock);
	may_delete_now = vp->v_count == 1 && !vn_has_cached_data(vp);
	mutex_exit(&vp->v_lock);

  This has been replaced with:

	mutex_enter(&zp->z_lock);
	may_delete_now = atomic_read(&ip->i_count) == 1 && !(zp->z_is_mapped);
	mutex_exit(&zp->z_lock);

Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-01-21 16:59:30 -08:00
Chunwei Chen
16522ac290 Use tsd to store tq for taskq_member
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #500
Closes #504
Closes #505
2016-01-20 13:07:45 -08:00
Brian Behlendorf
de77e24590 Linux 4.5 compat: pfn_t typedef
The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers.  This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #524
2016-01-20 11:39:18 -08:00
Chunwei Chen
d348f23a6a Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark
In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #523
2016-01-20 11:38:31 -08:00
Brian Behlendorf
4967a3eb9d Linux 4.5 compat: xattr list handler
The registered xattr .list handler was simplified in the 4.5 kernel
to only perform a permission check.  Given a dentry for the file it
must return a boolean indicating if the name is visible.  This
differs slightly from the previous APIs which also required the
function to copy the name in to the provided list and return its
size.  That is now all the responsibility of the caller.

This should be straight forward change to make to ZoL since we've
always required the caller to make the copy.  However, this was
slightly complicated by the need to support 3 older APIs.  Yes,
between 2.6.32 and 4.5 there are 4 versions of this interface!

Therefore, while the functional change in this patch is small it
includes significant cleanup to make the code understandable and
maintainable.  These changes include:

- Improved configure checks for .list, .get, and .set interfaces.
  - Interfaces checked from newest to oldest.
  - Strict checking for each possible known interface.
  - Configure fails when no known interface is available.
  - HAVE_*_XATTR_LIST renamed HAVE_XATTR_LIST_* for consistency
    with similar iops and fops configure checks.

- POSIX_ACL_XATTR_{DEFAULT|ACCESS} were removed forcing callers to
  move to their replacements, XATTR_NAME_POSIX_ACL_{DEFAULT|ACCESS}.
  Compatibility wrapper were added for old kernels.

- ZPL_XATTR_LIST_WRAPPER added which behaves the same as the existing
  ZPL_XATTR_{GET|SET} WRAPPERs.  Only the inode is guaranteed to be
  a valid pointer, passing NULL for the 'list' and 'name' variables
  is allowed and must be checked for.  All .list functions were
  updated to use the wrapper to aid readability.

- zpl_xattr_filldir() updated to use the .list function for its
  permission check which is consistent with the updated Linux 4.5
  interface.  If a .list function is registered it should return 0
  to indicate a name should be skipped, if there is no registered
  function the name will be added.

- Additional documentation from xattr(7) describing the correct
  behavior for each namespace was added before the relevant handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
2016-01-20 11:36:56 -08:00
Josef 'Jeff' Sipek
bc89ac8479 Illumos 5045 - use atomic_{inc,dec}_* instead of atomic_add_*
5045 use atomic_{inc,dec}_* instead of atomic_add_*
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5045
  https://github.com/illumos/illumos-gate/commit/1a5e258

Porting notes:
- All changes to non-ZFS files dropped.
- Changes to zfs_vfsops.c dropped because they were Illumos specific.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4220
2016-01-15 15:38:36 -08:00
Joe Stein
82f6f6e654 Illumos 6298 - zfs_create_008_neg and zpool_create_023_neg
6298 zfs_create_008_neg and zpool_create_023_neg need to be updated
for large block support
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/6298
  https://github.com/illumos/illumos-gate/commit/e9316f7

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4217
2016-01-15 15:38:35 -08:00
Richard Yao
546f38433a SET_ERROR should print strings
When debugging with tracepoints, we see string pointers:

zfs  3017 [006]  8878.728915: zfs:zfs_set__error: ffffffffa0eec3fc:3013:ffffffffa0ebcd60(): error 0x2
        ffffffffa0e1ca43 spa_open_common (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa0e1cbe3 spa_open (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa0e6f6ef zfs_ioc_stable (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa0e6f2a9 zfsdev_ioctl (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffff811909dd do_vfs_ioctl ([kernel.kallsyms])
        ffffffff81190c41 sys_ioctl ([kernel.kallsyms])
        ffffffff8156e2e9 system_call_fastpath ([kernel.kallsyms])
            7ff7d8be69c7 __GI___ioctl (/lib64/libc-2.19.so)
            7ff7d90cac53 lzc_ioctl.constprop.3 (/lib64/libzfs_core.so.1.0.0)
        636f695f637a6c00 [unknown] ([unknown])

Printing the actual strings is more convenient:

zfs  3461 [001] 10599.847692: zfs:zfs_set__error: spa.c:3013:spa_open_common(): error 0x2
        ffffffffa116ba43 spa_open_common (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa116bbe3 spa_open (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa11be8df zfs_ioc_stable (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffffa11be499 zfsdev_ioctl (/lib/modules/3.12.21-gentoo-r1/extra/zfs/zfs.ko)
        ffffffff811909dd do_vfs_ioctl ([kernel.kallsyms])
        ffffffff81190c41 sys_ioctl ([kernel.kallsyms])
        ffffffff8156e2e9 system_call_fastpath ([kernel.kallsyms])
            7f11b843c9c7 __GI___ioctl (/lib64/libc-2.19.so)
            7f11b8920c53 lzc_ioctl.constprop.3 (/lib64/libzfs_core.so.1.0.0)
        636f695f637a6c00 [unknown] ([unknown])

A few other tracepoints have strings as well, so switch to printing
the actual string values at the same time.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4212
2016-01-15 15:38:35 -08:00
Richard Yao
b10695c8f1 Remove fastwrite mutex
The fast write mutex is intended to protect accounting, but it is
redundant because all accounting is performed through atomic operations.
It also serializes all metaslab IO behind a mutex, which introduces a
theoretical scaling regression that the Illumos developers did not like
when we showed this to them. Removing it makes the selection of the
metaslab_group lock free as it is on Illumos. The selection is not quite
the same without the lock because the loop races with IO completions,
but any imbalances caused by this are likely to be corrected by
subsequent metaslab group selections.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3643
2016-01-15 15:38:35 -08:00
Brian Behlendorf
c96c36fa22 Fix zsb->z_hold_mtx deadlock
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4106
2016-01-15 15:33:45 -08:00
Brian Behlendorf
0720116d4d Add zfs_object_mutex_size module option
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #4106
2016-01-15 15:33:44 -08:00
Brian Behlendorf
d21f279a94 Illumos 3465, 3466, 3467, 3468, 3470, 3473
3465 ::walk ... | ::<dcmd> misinterprets input as symbol names
3466 ::tsd should handle missing/NULL values better
3467 mdb_ctf_vread() could be more useful
3468 mdb enhancements for zfs development
3470 ::whatis does not print callers from KMF_LITE
3473 mdb_get_module() returns wrong module
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3468
  https://github.com/illumos/illumos-gate/commit/28e4da2

Porting notes:
- The only portion of this patch which applies to ZoL is a small
  change to types used in the refcount structure.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4216
2016-01-13 13:56:27 -08:00
Brian Behlendorf
89666a8e1c Increase default user space stack size
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest.  This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler.  Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.

$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...

loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4215
2016-01-13 13:55:12 -08:00