mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 00:41:29 +03:00

Author	SHA1	Message	Date
Tom Caputi	2c24b5b148	Fix issues found with zfs diff Two deadlocks / ASSERT failures were introduced in `a2c2ed1b` which would occur whenever arc_buf_fill() failed to decrypt a block of data. This occurred because the call to arc_buf_destroy() which was responsible for cleaning up the newly created buffer would attempt to take out the hdr lock that it was already holding. This was resolved by calling the underlying functions directly without retaking the lock. In addition, the dmu_diff() code did not properly ensure that keys were loaded and mapped before begining dataset traversal. It turns out that this code does not need to look at any encrypted values, so the code was altered to perform raw IO only. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7354 Closes #7456	2018-05-01 11:24:20 -07:00
Tomohiro Kusumi	d6133fc500	Silence compile-time warning on unused variable ASSERT3U() could be NOP which then leads to having unused pointer spa. metaslab.c: In function 'metaslab_condense': metaslab.c:2075:9: warning: unused variable 'spa' [-Wunused-variable] spa_t spa = msp->ms_group->mg_vd->vdev_spa; Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #7489	2018-05-01 11:15:54 -07:00
loli10K	85ce3f4fd1	Adopt pyzfs from ClusterHQ This commit introduces several changes: * Update LICENSE and project information * Give a good PEP8 talk to existing Python source code * Add RPM/DEB packaging for pyzfs * Fix some outstanding issues with the existing pyzfs code caused by changes in the ABI since the last time the code was updated * Integrate pyzfs Python unittest with the ZFS Test Suite * Add missing libzfs_core functions: lzc_change_key, lzc_channel_program, lzc_channel_program_nosync, lzc_load_key, lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops, lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync, lzc_unload_key, lzc_remap Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow to differentiate the case where we tried to unload a key on a non-existing dataset (ENOENT) from the situation where a dataset has no key loaded: this is consistent with the "change" case where trying to zfs_ioc_change_key() from a dataset with no key results in EACCES. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7230	2018-05-01 10:33:35 -07:00
Alexander Motin	20507534d4	OpenZFS 9434 - Speculative prefetch is blocked by device removal code Device removal code does not set spa_indirect_vdevs_loaded for pools that never experienced device removal. At least one visual consequence of it is completely blocked speculative prefetcher. This patch sets the variable in such situations. Authored by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9434 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/16127b627b Closes #7480	2018-04-30 13:05:55 -07:00
Matthew Ahrens	964c2d69a9	OpenZFS 9236 - nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Patch Notes: * Removed ZFS_DEBUG_SPA from zfs-module-parameters.5 OpenZFS-issue: https://www.illumos.org/issues/9236 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668 Closes #7467	2018-04-30 10:19:48 -07:00
Mark Wright	089500e792	Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build Fix build errors with gcc 7.3.0 on Gentoo with kernel 4.16.3 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/zfs/vdev_indirect.c:296:2: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init] vdev_indirect_map_free, ^~~~~~~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Mark Wright <gienah@gentoo.org> Closes #7464	2018-04-20 09:53:25 -07:00
Chunwei Chen	599b864813	Fix ENOSPC in "Handle zap_add() failures in ..." Commit `cc63068` caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since `cc63068` has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7401 Closes #7421	2018-04-18 14:19:50 -07:00
Tom Caputi	b0ee5946aa	Fix issues with raw sends of spill blocks This patch fixes 2 issues in how spill blocks are processed during raw sends. The first problem is that compressed spill blocks were using the logical length rather than the physical length to determine how much data to dump into the send stream. The second issue is a typo that caused the spill record's object number to be used where the objset's ID number was required. Both issues have been corrected, and the payload_size is now printed in zstreamdump for future debugging. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7378 Closes #7432	2018-04-17 11:19:03 -07:00
Tom Caputi	e14a32b1c8	Fix object reclaim when using large dnodes Currently, when the receive_object() code wants to reclaim an object, it always assumes that the dnode is the legacy 512 bytes, even when the incoming bonus buffer exceeds this length. This causes a buffer overflow if --enable-debug is not provided and triggers an ASSERT if it is. This patch resolves this issue and adds an ASSERT to ensure this can't happen again. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7097 Closes #7433	2018-04-17 11:13:57 -07:00
Matthew Ahrens	0c03d21ac9	assertion in arc_release() during encrypted receive In the existing code, when doing a raw (encrypted) zfs receive, we call arc_convert_to_raw() from open context. This creates a race condition between arc_release()/arc_change_state() and writing out the block from syncing context (arc_write_ready/done()). This change makes it so that when we are doing a raw (encrypted) zfs receive, we save the crypt parameters (salt, iv, mac) of dnode blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() from syncing context when writing out the block of dnodes. Additionally, we can eliminate dr_raw and associated setters, and instead know that dnode blocks are always raw when doing a zfs receive (see the new field os_raw_receive). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7424 Closes #7429	2018-04-17 11:06:54 -07:00
Matthew Ahrens	7f96cc23ac	OpenZFS 9192 - explicitly pass good_writes to vdev_uberblock/label_sync Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. OpenZFS-issue: https://www.illumos.org/issues/9192 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f4c0b602d Closes #7446	2018-04-17 10:45:47 -07:00
Matt Ahrens	d830d4795a	OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices Authored by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c Closes #7445	2018-04-17 10:44:50 -07:00
megari	d68ac65eb6	Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier" This reverts commit `cbb8933215`. The original change in OpenZFS 9036 did remove duplicate 'const' specifiers, but the ZoL port had already done what should have been done in OpenZFS 9036, which is to make the pointers themselves const. The port of the change to ZoL ended up doing an unnecessary removal of the constness of the pointers. Undo that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ari Sundholm <ari@tuxera.com> Closes #7444	2018-04-16 12:44:40 -07:00
Pavel Zakharov	9eb7b46ed0	OpenZFS 7638 - Refactor spa_load_impl into several functions Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7638 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6 Closes #7437	2018-04-16 12:24:23 -07:00
Tim Chase	5284f43a1e	Avoid Linux hung task message in ZTHR Use an interruptible to avoid Linux hung task message in ZTHR and to prevent inflating the load average. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7440 Closes #7441	2018-04-15 15:12:28 -07:00
Toomas Soome	5e567da987	OpenZFS 9213 - zfs: sytem typo Authored by: Toomas Soome <tsoome@me.com> Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: * The additional instances of this typo addressed in the OpenZFS patch were already resolved. OpenZFS-issue: https://illumos.org/issues/9213 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92 Closes #7436	2018-04-15 10:59:13 -07:00
Toomas Soome	cbb8933215	OpenZFS 9036 - zfs: duplicate 'const' declaration specifier Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9036 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434 Closes #6900	2018-04-14 12:40:52 -07:00
Prakash Surya	eecdd8e884	OpenZFS 9084 - spa_*_ashift must ignore spare devices It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9084 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7 Closes #6900	2018-04-14 12:40:52 -07:00
Serapheim Dimitropoulos	4bf8108ede	OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9080 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6 Closes #6900	2018-04-14 12:40:47 -07:00
Serapheim Dimitropoulos	9d5b524597	OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. This commit introduces the ZTHR infrastructure, which is basically threads created during spa_load()/spa_create() and exist until we export or destroy the pool. ZTHRs sleep the majority of the time, until they are notified to wake up and do some predefined type of work. In the context of the current bug, a zthr to does the condensing of indirect mappings replacing the older code that used bare kthreads. When a pool is created, the condensing zthr is spawned but sleeps right away, until it is awaken by a signal from spa_sync(). If an existing pool is loaded, the condensing zthr looks if there is anything to condense before going to sleep, in case we were condensing mappings in the pool before it got exported. The benefits of this solution are the following: - The current bug is fixed - spa_condensing_indirect is the sole indicator of whether we are currently condensing or not - condensing is more decoupled from the spa_async_thread related functionality. As a final note, this commit also sets up the path on upstreaming other features that use the ZTHR code like zpool checkpoint and fast clone deletion. Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9079 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee Closes #6900	2018-04-14 12:23:53 -07:00
Brian Behlendorf	4589f3ae4c	Optimize possible split block search space Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6900	2018-04-14 12:22:43 -07:00
Matthew Ahrens	9e052db462	OpenZFS 9290 - device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591 Closes #6900	2018-04-14 12:21:39 -07:00
Matthew Ahrens	a1d477c24c	OpenZFS 7614, 9064 - zfs device evacuation/removal OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900	2018-04-14 12:16:17 -07:00
Seth Forshee	93b43af10d	Allow mounting datasets more than once Currently mounting an already mounted zfs dataset results in an error, whereas it is typically allowed with other filesystems. This causes some bad interactions with mount namespaces. Take this sequence for example: - Create a dataset - Create a snapshot of the dataset - Create a clone of the snapshot - Create a new mount namespace - Rename the original dataset The rename results in unmounting and remounting the clone in the original mount namespace, however the remount fails because the dataset is still mounted in the new mount namespace. (Note that this means the mount in the new mount namespace is never being unmounted, so perhaps the unmount/remount of the clone isn't actually necessary.) The problem here is a result of the way mounting is implemented in the kernel module. Since it is not mounting block devices it uses mount_nodev() instead of the usual mount_bdev(). However, mount_nodev() is written for filesystems for which each mount is a new instance (i.e. a new super block), and zfs should be able to detect when a mount request can be satisfied using an existing super block. Change zpl_mount() to call sget() directly with it's own test callback. Passing the objset_t object as the fs data allows checking if a superblock already exists for the dataset, and in that case we just need to return a new reference for the sb's root dentry. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Closes #5796 Closes #7207	2018-04-13 10:44:05 -07:00
beren12	7403d0743e	Fix zfs_arc_max minimum tuning When setting `zfs_arc_max` its minimum value is allowed to be 64 MiB. There was an off-by-1 error which can matter on tiny systems. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Zubrzycki <github@mid-earth.net> Closes #7417	2018-04-12 10:47:32 -07:00
Tom Caputi	edc1e713c2	Fix race in dnode_check_slots_free() Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7147 Closes #7388	2018-04-10 11:15:05 -07:00
Giuseppe Di Natale	10f88c5cd5	Linux compat 4.16: blk_queue_flag_{set,clear} queue_flag_{set,clear}_unlocked are now private interfaces in the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14). Use blk_queue_flag_{set,clear} interfaces which were introduced as of https://github.com/torvalds/linux/commit/8814ce8. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7410	2018-04-10 10:32:14 -07:00
Tony Hutter	4f301661df	Revert "Handle zap_add() failures in mixed ... " This reverts commit `cc63068e95`. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #7401 Cloes #7416	2018-04-09 14:24:46 -07:00
Brian Behlendorf	3b0d99289a	Fix 'zfs send/recv' hang with 16M blocks When using 16MB blocks the send/recv queue's aren't quite big enough. This change leaves the default 16M queue size which a good value for most pools. But it additionally ensures that the queue sizes are at least twice the allowed zfs_max_recordsize. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7365 Closes #7404	2018-04-08 19:41:15 -07:00
Matthew Ahrens	5c27ec1088	Fixes for SNPRINTF_BLKPTR with encrypted BP's mdb doesn't have dmu_ot[], so we need a different mechanism for its SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated. Additionally, since it already relies on BP_IS_ENCRYPTED (etc), SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own, rather than making the caller do so. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7390	2018-04-06 13:30:26 -07:00
Olaf Faaland	0ba106e75c	Fix divide-by-zero in mmp_delay_update() vdev_count_leaves() in the denominator may return 0, caught by Coverity. Introduced by * `533ea04` Update mmp_delay on sync or skipped, failed write Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7391	2018-04-06 13:29:11 -07:00
Olaf Faaland	533ea0415b	Update mmp_delay on sync or skipped, failed write When an MMP write is skipped, or fails, and time since mts->mmp_last_write is already greater than mts->mmp_delay, increase mts->mmp_delay. The original code only updated mts->mmp_delay when a write succeeded, but this results in the write(s) after delays and failed write(s) reporting an ub_mmp_delay which is too low. Update mmp_last_write and mmp_delay if a txg sync was successful. At least one uberblock was written, thus extending the time we can be sure the pool will not be imported by another host. Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) / vdev_count_leaves()) so that a period of frequent successful MMP writes, e.g. due to frequent txg syncs, does not result in an import activity check so short it is not reliable based on mmp thread writes alone. Remove unnecessary local variable, start. We do not use the start time of the loop iteration. Add a debug message in spa_activity_check() to allow verification of the import_delay value and to prove the activity check occurred. Alter the tests that import pools and attempt to detect an activity check. Calculate the expected duration of spa_activity_check() based on module parameters at the time the import is performed, rather than a fixed time set in mmp.cfg. The fixed time may be wrong. Also, use the default zfs_multihost_interval value so the activity check is longer and easier to recognize. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7330	2018-04-04 16:38:44 -07:00
Tony Hutter	21a4f5cc86	Fedora 28: Fix misc bounds check compiler warnings Fix a bunch of (mostly) sprintf/snprintf truncation compiler warnings that show up on Fedora 28 (GCC 8.0.1). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7361 Closes #7368	2018-04-04 10:16:47 -07:00
LOLi	1724eb62de	Fix spa reference leak in zfs_ioc_pool_scan zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a valid pool_scrub_cmd_t: this could happen if the userland binaries and ZFS kernel module differ in version and would prevent the pool from being exported. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7380	2018-04-03 17:31:30 -07:00
Tim Chase	10adee27ce	Remove ASSERT() in l2arc_apply_transforms() The ASSERT was erroneously copied from the next section of code. The buffer's size should be expanded from "psize" to "asize" if necessary. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7375	2018-03-31 15:14:21 -07:00
Tom Caputi	a2c2ed1bd4	Decryption error handling improvements Currently, the decryption and block authentication code in the ZIO / ARC layers is a bit inconsistent with regards to the ereports that are produces and the error codes that are passed to calling functions. This patch ensures that all of these errors (which begin as ECKSUM) are converted to EIO before they leave the ZIO or ARC layer and that ereports are correctly generated on each decryption / authentication failure. In addition, this patch fixes a bug in zio_decrypt() where ECKSUM never gets written to zio->io_error. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7372	2018-03-31 11:12:51 -07:00
Tom Caputi	4515b1d01c	Encrypted dnode blocks should be prefetched raw Encrypted dnode blocks are always initially read as raw data and converted to decrypted data when an encrypted bonus buffer is needed. This allows the DMU to be used for things like fetching the DMU master node without requiring keys to be loaded. However, dbuf_issue_final_prefetch() does not currently read the data as raw. The end result of this is that prefetched dnode blocks are read twice from disk: once decrypted and then again as raw data. This patch corrects the issue by adding the flag when appropriate. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7362	2018-03-31 11:11:48 -07:00
LOLi	77d8a0f1a4	Fix hung z_zvol tasks during 'zfs receive' During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6330 Closes #6890 Closes #7343	2018-03-30 12:10:01 -07:00
Andriy Gapon	5e00213e43	OpenZFS 9164 - assert: newds == os->os_dsl_dataset Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: * Re-enabled and tweaked the zpool_upgrade_007_pos test case to successfully run in under 5 minutes. OpenZFS-issue: https://www.illumos.org/issues/9164 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a Closes #6112 Closes #7336	2018-03-30 12:00:40 -07:00
Tom Caputi	32dce2da0c	Resolve QAT issues with incompressible data Currently, when ZFS wants to accelerate compression with QAT, it passes a destination buffer of the same size as the source buffer. Unfortunately, if the data is incompressible, QAT can actually "compress" the data to be larger than the source buffer. When this happens, the QAT driver will return a FAILED error code and print warnings to dmesg. This patch fixes these issues by providing the QAT driver with an additional buffer to work with so that even completely incompressible source data will not cause an overflow. This patch also resolves an error handling issue where incompressible data attempts compression twice: once by QAT and once in software. To fix this issue, a new (and fake) error code CPA_STATUS_INOMPRESSIBLE has been added so that the calling code can correctly account for the difference between a hardware failure and data that simply cannot be compressed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7338	2018-03-29 17:40:34 -07:00
Tom Caputi	13a2ff2727	Fix ASSERT in dsl_scan_fini() and cleanup comments This patch fixes an issue where dsl_scan_prefetch_cb() might add more prefetch I/Os to the prefetch queue after prefetching has been completed. This was happening because that code was checking scn->scn_suspending instead of scn->scn_prefetch_stop. This occasionally triggered an ASSERT during ztest runs in dsl_scan_fini() when the code attempted to destroy an AVL tree that still had entires in it. This patch also includes a number of spelling corrections and comment cleanups throughout dsl_scan.c Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7353	2018-03-28 18:30:44 -07:00
Brian Behlendorf	b2ab468dde	Fix mmap / libaio deadlock Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339	2018-03-28 10:19:22 -07:00
Allan Jude	5152a74088	OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value Authored by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9321 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18 Closes #7333	2018-03-26 20:40:15 -07:00
Tom Caputi	157ef7f6a5	Don't count embedded bps in read stats Currently, ZFS tracks statistics about calls to arc_read() via the /proc/spl/kstat/zfs/<pool>/reads file for debugging. Unfortunately, this file currently counts embedded bps as disk reads since they are technically processed by the ZIO layer. This pollutes the log since the ARC will never cache embedded bps. This patch corrects this issue by preventing the logging of embedded bp reads. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7334	2018-03-23 21:35:19 -07:00
Matthew Ahrens	2fd92c3d6c	enable zfs_dbgmsg() by default, without dprintf() zfs_dbgmsg() should record a message by default. As a general principal, these messages shouldn't be too verbose. Furthermore, the amount of memory used is limited to 4MB (by default). dprintf() should only record a message if this is a debug build, and ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default (even on debug builds). These messages are extremely verbose, and sometimes nontrivial to compute. SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set in zfs_flags. This flag is not set by default (even on debug builds). This brings our behavior in line with illumos. Note that the message format is unchanged (including file, line, and function, even though these are not recorded on illumos). Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7314	2018-03-21 15:37:32 -07:00
Tom Caputi	8d9e7c8fbe	Fix spelling errors in comments This patch simply corrects some spelling / grammar errors in the QAT and encryption code comments. No functional changes Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7319	2018-03-21 08:42:13 -07:00
Brian Behlendorf	a76f3d0437	Fix deadlock in IO pipeline In vdev_queue_aggregate() the zio_execute() bypass should not be called under the vdev queue lock. This can result in a deadlock as shown in the stack traces below. Drop the vdev queue lock then walk the parents of the aggregate IO to determine the list of component IOs to be bypassed. This can be done safely without holding the io_lock since the new aggregate IO has not yet been returned and its parents cannot change. --- THREAD 1 --- arc_read() zio_nowait() zio_vdev_io_start() vdev_queue_io() <--- mutex_enter(vq->vq_lock) vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() zio_vdev_io_assess() zio_wait_for_children() <- mutex_enter(zio->io_lock) --- THREAD 2 --- (inverse order) arc_read() zio_change_priority() <- mutex_enter(zio->zio_lock) vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock) Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7307	2018-03-16 16:46:06 -07:00
Tom Caputi	17dd88352e	Allow QAT to handle non page-aligned buffers This patch adds some handling to the QAT acceleration functions that allows them to handle buffers that are not aligned with the page cache. At the moment this never happens since callers only happen to work with page-aligned buffers, but this code should prevent headaches if this isn't always true in the future. This patch also adds some cleanups to align the QAT compression code with the encryption and checksumming code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7305	2018-03-16 16:02:49 -07:00
Olaf Faaland	cec3a0a1bb	Report pool suspended due to MMP When the pool is suspended, record whether it was due to an I/O error or due to MMP writes failing to succeed within the required time. Change spa_suspended from uint8_t to zio_suspend_reason_t to store the reason. When userspace queries pool status via spa_tryimport(), report the reason the pool was suspended in a new key, ZPOOL_CONFIG_SUSPENDED_REASON. In libzfs, when interpreting the returned config nvlist, report suspension due to MMP with a new pool status enum value, ZPOOL_STATUS_IO_FAILURE_MMP. In status_callback(), which generates and emits the message when 'zpool status' is executed, add a case to print an appropriate message for the new pool status enum value. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7296	2018-03-15 10:56:55 -07:00
Tom Caputi	3874220932	SHA256 QAT acceleration This patch enables acceleration of SHA256 checksums using Intel Quick Assist Technology. This patch also fixes up and refactors some of the code from QAT encryption to make the behavior consistent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7295	2018-03-15 10:53:58 -07:00
Paul Zuchowski	1a2342784a	receive_spill does not byte swap spill contents In zfs receive, the function receive_spill should account for spill block endian conversion as a defensive measure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7300	2018-03-15 10:29:51 -07:00
Brian Behlendorf	de4f8d5d26	OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Porting Notes: * Added modules options to zfs-module-parameters.5 man page. * Preserved scaling based on target ARC size rather than max ARC size. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9188 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564 Upstream bug: DLPX-46942 Closes #7273	2018-03-13 10:52:48 -07:00
Brian Behlendorf	a6cc97566c	Add kernel module auto-loading Historically a dynamic misc minor number was registered for the /dev/zfs device in order to prevent minor number collisions. This was fine but it prevented us from being able to use the kernel module auto-loaded which requires a known reserved value. Resolve this issue by adding a configure test to find an available misc minor number which can then be used in MODULE_ALIAS_MISCDEV at build time. By adding this alias the zfs kmod is added to the list of known static-nodes and the systemd-tmpfiles-setup-dev service will create a /dev/zfs character device at boot time. This in turn allows us to update the 90-zfs.rules file to make it aware this is a static node. The upshot of this is that whenever a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be automatic loaded. This even works for unprivileged users so there is no longer a need to manually load the modules at boot time. As an additional bonus the zed now no longer needs to start after the zfs-import.service since it will trigger the module load. In the unlikely event the minor number we selected conflicts with another out of tree unregistered minor number the code falls back to dynamically allocating it. In this case the modules again must be manually loaded. Note that due to the change in the method of registering the minor number the zimport.sh test case may incorrectly fail when the static node for the installed packages is created instead of the dynamic one. This issue will only transiently impact zimport.sh for this single commit when we transition and are mixing and matching methods. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7287	2018-03-13 10:45:55 -07:00
Tim Chase	02638a30ef	Add zfs_scan_ignore_errors tunable When it's set, a DTL range will be cleared even if its scan/scrub had errors. This allows to work around resilver/scrub upon import when the pool has errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7293	2018-03-13 10:43:14 -07:00
Chunwei Chen	9470cbd4f9	Fix race in trace point in zrl_add_impl We hit an illegal memory access in the zrlock trace point. The problem is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if zrl->zr_owner got assigned a longer string between when __string() calculate the strlen, and when __assign_str() does strcpy. The copy will overflow the buffer. == For example: Initial condition: zrl->zr_owner = A zrl->zr_caller = "abc" Thread A Thread B ------------------------------------------------- if (zrl->zr_owner == A) { DTRACE_PROBE2() { __string() { strlen(zrl->zr_caller) -> 3 allocate buf[4] } zrl->zr_owner = B zrl->zr_caller = "abcd" __assign_str() { strcpy(buf, zrl->zr_caller) <- buffer overflow == Dereferencing zrl->zr_owner->pid may also be problematic, in that the zrl->zr_owner got changed to other task, and that task exits, freeing the task_struct. This should be very unlikely, as the other task need to zrl_remove and exit between the dereferencing zr->zr_owner and zr->zr_owner->pid. Nevertheless, we'll deal with it as well. To fix the zrl->zr_caller issue, instead of copy the string content, we just copy the pointer, this is safe because it always points to __func__, which is static. As for the zrl->zr_owner issue, we pass in curthread instead of using zrl->zr_owner. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7291	2018-03-12 11:27:02 -07:00
Brian Behlendorf	b7eec00f9f	Fix MMP write frequency for large pools When a single pool contains more vdevs than the CONFIG_HZ for for the kernel the mmp thread will not delay properly. Switch to using cv_timedwait_sig_hires() to handle higher resolution delays. This issue was reported on Arch Linux where HZ defaults to only 100 and this could be fairly easily reproduced with a reasonably large pool. Most distribution kernels set CONFIG_HZ=250 or CONFIG_HZ=1000 and thus are unlikely to be impacted. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7205 Closes #7289	2018-03-12 11:26:05 -07:00
Olaf Faaland	743253df70	Hold SCL_VDEV when counting leaves A config lock should be held while vdev_count_leaves() walks the tree, otherwise the pointers reference may become invalid during the walk. SCL_VDEV is a minimal lock provided for such uses cases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7286	2018-03-09 15:42:54 -08:00
Olaf Faaland	ebed90a598	Handle zio_resume and mmp => off When multihost is disabled on a pool, and the pool is resumed via zpool clear, within a single cycle of the mmp thread's loop (e.g. while it's in the cv_timedwait call), both mmp_last_write and mmp_delay should be updated. The original code mistakenly treated the two cases as if they could not occur at the same time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7286	2018-03-09 15:42:11 -08:00
Tom Caputi	cf63739191	QAT support for AES-GCM This patch adds support for acceleration of AES-GCM encryption with Intel Quick Assist Technology. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7282	2018-03-09 13:37:15 -08:00
Wolfgang Bumiller	0e85048f53	Take user namespaces into account in policy checks Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). This also adds an initial user namespace regression test for the setgid bit loss, with a user_ns_exec helper usable in further tests. Additionally, configure checks for the required user namespace related features are added for: * ns_capable * kuid/kgid_has_mapping() * user_ns in cred_t Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Closes #6800 Closes #7270	2018-03-07 15:40:42 -08:00
Olaf Faaland	d2160d0538	Record skipped MMP writes in multihost_history Once per pass through the MMP thread's loop, the vdev tree is walked to find a suitable leaf to write the next MMP block to. If no such leaf is found, the thread sleeps for a while and resumes at the top of the loop. Add an entry to multihost_history when no leaf can be found, and record the reason in the error column. The error code for such entries is a bitfield, displayed in hex: 0x1 At least one vdev (interior or leaf) was not writeable. 0x2 At least one writeable leaf vdev was found, but it had a pending MMP write. timestamp = the time in seconds since the epoch when no leaf could be found originally. duration = the time (in ns) during which no MMP block was written for this reason. This does not include the preceeding inter-write period nor the following inter-write period. vdev_guid = the number of sequential cycles of the MMP thread looop when this occurred. Sample output, truncated to fit: For records of skipped MMP writes the right-most column, vdev_path, is reported as "-". id txg timestamp error duration mmp_delay vdev_guid ... 936 11 1520036441 0 146264 891422313 1740883117838 ... 937 11 1520036441 0 163956 888356657 7320395061548 ... 938 11 1520036442 0 130690 885314969 7320395061548 ... 939 11 1520036442 0 2001068577 882296582 1740883117838 ... 940 11 1520036443 0 161806 882296582 7320395061548 ... 941 11 1520036443 0x2 0 998020546 1 ... 942 11 1520036444 0 136585 998020546 7320395061548 ... 943 11 1520036444 0x2 0 998020257 1 ... 944 11 1520036445 5 2002662964 994160219 1740883117838 ... 945 11 1520036445 0x2 998073118 994160219 3 ... 946 11 1520036447 0 247136 994160219 7320395061548 ... Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7212	2018-03-06 15:15:15 -08:00
Olaf Faaland	14c240cede	Detect long config lock acquisition in mmp If something holds the config lock as a writer for too long, MMP will fail to issue MMP writes in a timely manner. This will result either in the pool being suspended, or in an extreme case, in the pool not being protected. If the time to acquire the config lock exceeds 1/10 of the minimum zfs_multihost_interval, report it in the zfs debug log. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7212	2018-03-06 15:14:39 -08:00
Nasf-Fan	2705ebf0a7	Misc fixes and cleanup for project quota 1) The Coverity Scan reports some issues for the project quota patch, including: 1.1) zfs_prop_get_userquota() directly uses the const quota type value as the condition check by wrong. 1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid to be overwritten by dnode::dn->dn_oldprojid. 2) This patch fixes related issues. It also enhances the logic for zfs_project_item_alloc() to avoid buffer overflow. 3) Skip project quota ability check if does not change project quota related things (id or flag). Otherwise, it will cause chattr (for other non project quota flags) operation failed if project quota disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> Closes #7251 Closes #7265	2018-03-05 12:56:27 -08:00
Giuseppe Di Natale	dd3e1e3083	Linux 4.16 compat: get_disk_and_module() As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7264	2018-03-05 12:44:35 -08:00
Tony Hutter	80d52c3919	Change checksum & IO delay ratelimit values Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec. This allows zed to actually trigger if a bunch of these events arrive in a short period of time (zed has a threshold of 10 events in 10 sec). Previously, if you had, say, 100 checksum errors in 1 sec, it would get ratelimited to 5/sec which wouldn't trigger zed to fault the drive. Also, convert the checksum and IO delay thresholds to module params for easy testing. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7252	2018-03-04 17:34:51 -08:00
chrisrd	5666a994f2	Increment zil_itx_needcopy_bytes properly In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw) into the log write block (lwb) and send it off using zil_lwb_add_txg(). If we also have WR_NEED_COPY, we additionally copy the lwr's data into the lwb to be sent off. If the lwr + data doesn't fit into the lwb, we send the lrw and as much data as will fit (dnow bytes), then go back and do the same with the remaining data. Each time through this loop we're sending dnow data bytes. I.e. zil_itx_needcopy_bytes should be incremented by dnow. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6988 Closes #7176	2018-03-02 10:01:53 -08:00
Tom Caputi	095495e008	Raw DRR_OBJECT records must write raw data `b1d21733` made it possible for empty metadnode blocks to be compressed to a hole, fixing a bug that would cause invalid metadnode MACs when a send stream attempted to free objects and allowing the blocks to be reclaimed when they were no longer needed. However, this patch also introduced a race condition; if a txg sync occurred after a DRR_OBJECT_RANGE record was received but before any objects were added, the metadnode block would be compressed to a hole and lose all of its encryption parameters. This would cause subsequent DRR_OBJECT records to fail when they attempted to write their data into an unencrypted block. This patch defers the DRR_OBJECT_RANGE handling to receive_object() so that the encryption parameters are set with each object that is written into that block. Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7215 Closes #7236	2018-02-27 09:04:05 -08:00
Brian Behlendorf	3673d03285	Fix more cstyle warnings This patch contains no functional changes. It is solely intended to resolve cstyle warnings in order to facilitate moving the spl source code in to the zfs repository. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #687	2018-02-24 10:05:37 -08:00
chrisrd	e9a7729008	Fix free memory calculation on v3.14+ Provide infrastructure to auto-configure to enum and API changes in the global page stats used for our free memory calculations. arc_free_memory has been broken since an API change in Linux v3.14: 2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node 2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node vmstats These commits moved some of global_page_state() into global_node_page_state(). The API change was particularly egregious as, instead of breaking the old code, it silently did the wrong thing and we continued using global_page_state() where we should have been using global_node_page_state(), thus indexing into the wrong array via NR_SLAB_RECLAIMABLE et al. There have been further API changes along the way: 2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to node counters 2017-09-06 v4.14 c41f012a mm: rename global_page_state to global_zone_page_state ...and various (incomplete, as it turns out) attempts to accomodate these changes in ZoL: 2017-08-24 `2209e409` Linux 4.8+ compatibility fix for vm stats 2017-09-16 `787acae0` Linux 3.14 compat: IO acct, global_page_state, etc 2017-09-19 `661907e6` Linux 4.14 compat: IO acct, global_page_state, etc The config infrastructure provided here resolves these issues going back to the original API change in v3.14 and is robust against further Linux changes in this area. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #7170	2018-02-23 08:50:06 -08:00
Olaf Faaland	7088545d01	Report duration and error in mmp_history entries After an MMP write completes, update the relevant mmp_history entry with the time between submission and completion, and the error status of the write. [faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost 39 0 0x01 100 8800 69147946270893 72723903122926 id txg timestamp error duration mmp_delay vdev_guid 10607 1166 1518985089 0 138301 637785455 4882... 10608 1166 1518985089 0 136154 635407747 1151... 10609 1166 1518985089 0 803618560 633048078 9740... 10610 1166 1518985090 0 144826 633048078 4882... 10611 1166 1518985090 0 164527 666187671 1151... Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and error = zio->io_error. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7190	2018-02-22 15:34:34 -08:00
Olaf Faaland	0d398b2564	Do not initiate MMP writes while pool is suspended While the pool is suspended on host A, it may be imported on host B. If host A continued to write MMP blocks, it would be blindly overwriting MMP blocks written by host B, and the blocks written by host A would have outdated txg information. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7182	2018-02-22 09:14:46 -08:00
Tom Caputi	f8478fc2ca	Fix bounds check in zio_crypt_do_objset_hmacs The current bounds check in zio_crypt_do_objset_hmacs() does not properly handle the possible sizes of the objset_phys_t and can therefore read outside the buffer's memory. If that memory happened to match what the check was actually looking for, the objset would fail to be owned, complaining that the MAC was invalid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7210	2018-02-22 08:50:14 -08:00
Tomohiro Kusumi	378c6ed549	Fix function name typos vn_init() and vn_fini() had been renamed by `12ff95ff` in 2011. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #686	2018-02-21 15:12:10 -08:00
Tomohiro Kusumi	68386b0503	Staticize kstat_default_update() This is only used via ->ks_update of `kstat_t *`. This isn't exported nor do headers have its prototype. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #686	2018-02-21 15:11:38 -08:00
Toomas Soome	09302a4ca8	OpenZFS 9035 - zfs: this statement may fall through Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9035 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/46ac8fdfc5 Closes #7206	2018-02-21 14:55:34 -08:00
Tom Caputi	b0918402dc	Raw receive should change key atomically Currently, raw zfs sends transfer the encrypted master keys and objset_phys_t encryption parameters in the DRR_BEGIN payload of each send file. Both of these are processed as soon as they are read in dmu_recv_stream(), meaning that the new keys are set before the new snapshot is received. In addition to the fact that this changes the user's keys for the dataset earlier than they might expect, the keys were never reset to what they originally were in the event that the receive failed. This patch splits the processing into objset handling and key handling, the later of which is moved to dmu_recv_end() so that they key change can be done atomically. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7200	2018-02-21 12:31:03 -08:00
Tom Caputi	4a385862b7	Prevent raw zfs recv -F if dataset is unencrypted The current design of ZFS encryption only allows a dataset to have one DSL Crypto Key at a time. As a result, it is important that the zfs receive code ensures that only one key can be in use at a time for a given DSL Directory. zfs receive -F complicates this, since the new dataset is received as a clone of the existing one so that an atomic switch can be done at the end. To prevent confusion about which dataset is actually encrypted a check was added to ensure that encrypted datasets cannot use zfs recv -F to completely replace existing datasets. Unfortunately, the check did not take into account unencrypted datasets being overriden by encrypted ones as a case. Along the same lines, the code also failed to ensure that raw recieves could not be done on top of existing unencrypted datasets, which causes amny problems since the new stream cannot be decrypted. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7199	2018-02-21 12:30:11 -08:00
Tom Caputi	b1d217338a	Raw receives must compress metadnode blocks Currently, the DMU relies on ZIO layer compression to free LO dnode blocks that no longer have objects in them. However, raw receives disable all compression, meaning that these blocks can never be freed. In addition to the obvious space concerns, this could also cause incremental raw receives to fail to mount since the MAC of a hole is different from that of a completely zeroed block. This patch corrects this issue by adding a special case in zio_write_compress() which will attempt to compress these blocks to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also removes the zfs_mdcomp_disable tunable, since tuning it could cause these same issues. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7198	2018-02-21 12:28:52 -08:00
Tom Caputi	5121c4fb0c	Remove unnecessary txg syncs from receive_object() `1b66810b` introduced serveral changes which improved the reliability of zfs sends when large dnodes were involved. However, these fixes required adding a few calls to txg_wait_synced() in the DRR_OBJECT handling code. Although most of them are currently necessary, this patch allows the code to continue without waiting in some cases where it doesn't have to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7197	2018-02-21 12:26:51 -08:00
Tom Caputi	478b3150de	Add omitted set for os->os_next_write_raw This one line patch adds adds a set to os->os_next_write_raw that was omitted when the code was updated in `1b66810`. Without it, the code (in some instances) could attempt to write raw encrypted data as regular unencrypted data without the keys being loaded, triggering an ASSERT in zio_encrypt(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7196	2018-02-21 12:24:37 -08:00
Tom Caputi	163a8c28dd	ZIL claiming should not start user accounting Currently, ZIL claiming dirties objsets which causes dsl_pool_sync() to attempt to perform user accounting on them. This causes problems for encrypted datasets that were raw received before the system went offline since they cannot perform user accounting until they have their keys loaded. This triggers an ASSERT in zio_encrypt(). Since encryption was added, the code now depends on the fact that data should only be written when objsets are owned. This patch adds a check in dmu_objset_do_userquota_updates() to ensure that useraccounting is only done when the objsets are actually owned for write. As part of this work, the zfsvfs and zvol code was updated so that it no longer lies about owning objsets readonly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6916 Closes #7163	2018-02-20 16:27:31 -08:00
Don Brady	cbce581353	Fix coverity defects: zfs channel programs CID 173243, 173245: Memory - corruptions (OVERRUN) Added size argument to lcompat_sprintf() to avoid use of INT_MAX CID 173244: Integer handling issues (OVERFLOW_BEFORE_WIDEN) Added cast to uint64_t to avoid a 32 bit overflow warning CID 173242: Integer handling issues (CONSTANT_EXPRESSION_RESULT) Conditionally removed unused luai_numisnan() floating point check CID 173241: Resource leaks (RESOURCE_LEAK) Added missing close(fd) on error path CID 173240: (UNINIT) Fixed uninitialized variable in get_special_prop() CID 147560: Null pointer dereferences (NULL_RETURNS) Cleaned up bad code merge in dsl_dataset_promote_check() CID 28475: Memory - illegal accesses (OVERRUN) Fixed lcompat_sprintf() to use a size paramater CID 28418, 28422: Error handling issues (CHECKED_RETURN) Added function result cast to (void) to avoid warning CID 23935, 28411, 28412: Memory - corruptions (ARRAY_VS_SINGLETON) Added casts to avoid exposing result as an array Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7181	2018-02-20 11:19:42 -08:00
Tom Caputi	7b30ee6baf	Project dnode should be protected by local MAC This patch corrects a small security issue with `9c5167d1`. When the project dnode was added to the objset_phys_t, it was not included in the local MAC for cryptographic protection, allowing an attacker to modify this data without the consent of the key holder. This patch does represent an on-disk format change for anyone using project dnodes on an encrypted dataset. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7177	2018-02-20 09:41:07 -08:00
Don Brady	62d5c55313	Address objtool check failures in lua module The use of void __attribute__((noreturn)) in kernel builds was causing lots of warnings if CONFIG_STACK_VALIDATION is active. For now we just remove this attribute to achieve clean builds for the Lua module. There was no significant increase in the time to run the full channel_program ZTS tests. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7173	2018-02-15 09:53:04 -08:00
George Wilson	ddc751d56b	OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio \|--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio \|--- lio \|--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio \|--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c Issue #5918 Closes #7168	2018-02-14 15:30:09 -08:00
Nasf-Fan	9c5167d19f	Project Quota on ZFS Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d\|-r] <file\|directory ...> zfs project -C [-k] [-r] <file\|directory ...> zfs project -c [-0] [-d\|-r] [-p id] <file\|directory ...> zfs project [-p id] [-r] [-s] <file\|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290	2018-02-13 14:54:54 -08:00
sanjeevbagewadi	918dbe35b5	mmp should use a fixed tag for spa_config locks mmp_write_uberblock() and mmp_write_done() should the same tag for spa_config_locks. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6530 Closes #7155	2018-02-12 11:30:38 -08:00
Andriy Gapon	1334283225	OpenZFS 8520 - lzc_rollback 8520 lzc_rollback_to should support rolling back to origin 7198 libzfs should gracefully handle EINVAL from lzc_rollback lzc_rollback_to() should support rolling back to a clone's origin. The current checks in zfs_ioc_rollback() would not allow that because the origin snapshot belongs to a different filesystem. The overly restrictive check was in introduced in 7600, but it was not a regression as none of the existing tools provided a way to rollback to the origin. Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8520 OpenZFS-issue: https://www.illumos.org/issues/7198 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/78a5a1a25a Closes #7150	2018-02-09 10:27:58 -08:00
sanjeevbagewadi	cc63068e95	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-02-09 10:15:53 -08:00
Chunwei Chen	f108a49236	Fix zle_decompress out of bound access Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:08:05 -08:00
Chunwei Chen	d3190c5f29	Fix zdb -c traverse stop on damaged objset root If a corruption happens to be on a root block of an objset, zdb -c will not correctly report the error, and it will not traverse the datasets that come after. This is because traverse_visitbp, which does the callback and reset error for TRAVERSE_HARD, is skipped when traversing zil is failed in traverse_impl. Here's example of what 'zdb -eLcc' command looks like on a pool with damaged objset root: == before patch: Traversing all blocks to verify checksums ... Error counts: errno count block traversal size 379392 != alloc 33987072 (unreachable 33607680) bp count: 172 ganged count: 0 bp logical: 1678336 avg: 9757 bp physical: 130560 avg: 759 compression: 12.85 bp allocated: 379392 avg: 2205 compression: 4.42 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 71 Dittoed blocks on same vdev: 101 == after patch: Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0> -- skipping Error counts: errno count 52 1 block traversal size 33963520 != alloc 33987072 (unreachable 23552) bp count: 447 ganged count: 0 bp logical: 36093440 avg: 80745 bp physical: 33699840 avg: 75391 compression: 1.07 bp allocated: 33963520 avg: 75981 compression: 1.06 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 76 Dittoed blocks on same vdev: 115 == Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:05:25 -08:00
Brian Behlendorf	18f57327e0	Linux 4.16 compat: inode_set_iversion() A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7148	2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos	5b72a38d68	OpenZFS 8677 - Open-Context Channel Programs Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> We want to be able to run channel programs outside of synching context. This would greatly improve performance for channel programs that just gather information, as they won't have to wait for synching context anymore. === What is implemented? This feature introduces the following: - A new command line flag in "zfs program" to specify our intention to run in open context. (The -n option) - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. === How do we handle zfs.sync functions in open context? When such a function is found by the interpreter and we are running in open context we abort the script and we spit out a descriptive runtime error. For example, given the script below ... arg = ... fs = arg["argv"][1] err = zfs.sync.destroy(fs) msg = "destroying " .. fs .. " err=" .. err return msg if we run it in open context, we will get back the following error: Channel program execution failed: [string "channel program"]:3: running functions from the zfs.sync submodule requires passing sync=TRUE to lzc_channel_program() (i.e. do not specify the "-n" command line argument) stack traceback: [C]: in function 'destroy' [string "channel program"]:3: in main chunk === What about testing? We've introduced new wrappers for all channel program tests that run each channel program as both (startard & open-context) and expect the appropriate behavior depending on the program using the zfs.sync module. OpenZFS-issue: https://www.illumos.org/issues/8677 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15 Closes #6558	2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos	8d103d8856	OpenZFS 8604 - Simplify snapshots unmounting code Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Every time we want to unmount a snapshot (happens during snapshot deletion or renaming) we unnecessarily iterate through all the mountpoints in the VFS layer (see zfs_get_vfs). The current patch completely gets rid of that code and changes the approach while keeping the behavior of that code path the same. Specifically, it puts a hold on the dataset/snapshot and gets its vfs resource reference directly, instead of linearly searching for it. If that reference exists we attempt to amount it. With the above change, it became obvious that the nvlist manipulations that we do (add_boolean and add_nvlist) take a significant amount of time ensuring uniqueness of every new element even though they don't have too. Thus, we updated the patch so those nvlists are not trying to enforce the uniqueness of their elements. A more complete analysis of the problem solved by this patch can be found below: https://sdimitro.github.io/post/snap-unmount-perf/ OpenZFS-issue: https://www.illumos.org/issues/8604 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb	2018-02-08 15:29:44 -08:00
Don Brady	fc5d4b6737	Increase code coverage for Lua libraries Add test coverage for lua libraries Remove dead code in Lua implementation Signed-off-by: Don Brady <don.brady@delphix.com>	2018-02-08 15:29:38 -08:00
Chris Williamson	234c91c508	OpenZFS 8600 - ZFS channel programs - snapshot Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to create snapshots. In addition to the base snapshot functionality, this entails extra logic to handle edge cases which were formerly not possible, such as creating then destroying a snapshot in the same transaction sync. OpenZFS-issue: https://www.illumos.org/issues/8600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b	2018-02-08 15:29:24 -08:00
Brad Lewis	af07368986	OpenZFS 8592 - ZFS channel programs - rollback Authored by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to perform a rollback. OpenZFS-issue: https://www.illumos.org/issues/8592 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6	2018-02-08 15:29:14 -08:00
Chris Williamson	475eca4908	OpenZFS 8605 - zfs channel programs fix zfs.exists Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> zfs.exists() in channel programs doesn't return any result, and should have a man page entry. This patch corrects zfs.exists so that it returns a value indicating if the dataset exists or not. It also adds documentation about it in the man page. OpenZFS-issue: https://www.illumos.org/issues/8605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1e85e111	2018-02-08 15:28:52 -08:00
Chris Williamson	d99a015343	OpenZFS 7431 - ZFS Channel Programs Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Don Brady <don.brady@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7431 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533 Porting Notes: * The CLI long option arguments for '-t' and '-m' don't parse on linux * Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc * Lua implementation is built as its own module (zlua.ko) * Lua headers consumed directly by zfs code moved to 'include/sys/lua/' * There is no native setjmp/longjump available in stock Linux kernel. Brought over implementations from illumos and FreeBSD * The get_temporary_prop() was adapted due to VFS platform differences * Use of inline functions in lua parser to reduce stack usage per C call * Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow	2018-02-08 15:28:18 -08:00
WHR	8824a7f133	OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable Authored by: WHR <msl0000023508@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8966 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc Closes #7141	2018-02-08 10:33:45 -08:00

1 2 3 4 5 ...

2330 Commits