mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-15 16:04:52 +03:00

Author	SHA1	Message	Date
Paul Dagnelie	66c8b2f65a	Don't activate metaslabs with weight 0 We return ENOSPC in metaslab_activate if the metaslab has weight 0, to avoid activating a metaslab with no space available. For sanity checking, we also assert that there is no free space in the range tree in that case. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8968	2020-01-22 13:48:57 -08:00
Paul Dagnelie	350646563f	Concurrent small allocation defeats large allocation With the new parallel allocators scheme, there is a possibility for a problem where two threads, allocating from the same allocator at the same time, conflict with each other. There are two primary cases to worry about. First, another thread working on another allocator activates the same metaslab that the first thread was trying to activate. This results in the first thread needing to go back and reselect a new metaslab, even though it may have waited a long time for this metaslab to load. Second, another thread working on the same allocator may have activated a different metaslab while the first thread was waiting for its metaslab to load. Both of these cases can cause the first thread to be significantly delayed in issuing its IOs. The second case can also cause metaslab load/unload churn; because the metaslab is loaded but not fully activated, we never set the selected_txg, which results in the metaslab being immediately unloaded again. This process can repeat many times, wasting disk and cpu resources. This is more likely to happen when the IO of the first thread is a larger one (like a ZIL write) and the other thread is doing a smaller write, because it is more likely to find an acceptable metaslab quickly. There are two primary changes. The first is to always proceed with the allocation when returning from metaslab_activate if we were preempted in either of the ways described in the previous section. The second change is to set the selected_txg before we do the call to activate so that even if the metaslab is not used for an allocation, we won't immediately attempt to unload it. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-61314 Closes #8843	2020-01-22 13:48:56 -08:00
loli10K	d47ee5ad1c	Fix bp_embedded_type enum definition With the addition of BP_EMBEDDED_TYPE_REDACTED in `30af21b0` a couple of codepaths make wrong assumptions and could potentially result in errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8951 Conflicts: include/sys/spa.h	2020-01-22 13:48:56 -08:00
Don Brady	e625030c11	OpenZFS 9425 - channel programs can be interrupted Problem Statement ================= ZFS Channel program scripts currently require a timeout, so that hung or long-running scripts return a timeout error instead of causing ZFS to get wedged. This limit can currently be set up to 100 million Lua instructions. Even with a limit in place, it would be desirable to have a sys admin (support engineer) be able to cancel a script that is taking a long time. Proposed Solution ================= Make it possible to abort a channel program by sending an interrupt signal.In the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to catch the signal. Once a signal is encountered, the dsl_sync_task function can install a Lua hook that will get called before the Lua interpreter executes a new line of code. The dsl_sync_task can resume with a standard txg_wait_sync call and wait for the txg to complete. Meanwhile, the hook will abort the script and indicate that the channel program was canceled. The kernel returns a EINTR to indicate that the channel program run was canceled. Porting notes: Added missing return value from cv_wait_sig() Authored by: Don Brady <don.brady@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/9425 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926 Closes #8904	2020-01-22 13:48:56 -08:00
Matthew Ahrens	cbb9154958	looping in metaslab_block_picker impacts performance on fragmented pools On fragmented pools with high-performance storage, the looping in metaslab_block_picker() can become the performance-limiting bottleneck. When looking for a larger block (e.g. a 128K block for the ZIL), we may search through many free segments (up to hundreds of thousands) to find one that is large enough to satisfy the allocation. This can take a long time (up to dozens of ms), and is done while holding the ms_lock, which other threads may spin waiting for. When this performance problem is encountered, profiling will show high CPU time in metaslab_block_picker, as well as in mutex_enter from various callers. The problem is very evident on a test system with a sync write workload with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage, 84% full and 88% fragmented. It has also been observed on production systems with 90TB of storage, 76% full and 87% fragmented. The fix is to change metaslab_df_alloc() to search only up to 16MB from the previous allocation (of this alignment). After that, we will pick a segment that is of the exact size requested (or larger). This reduces the number of iterations to a few hundred on fragmented pools (a ~100x improvement). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-62324 Closes #8877	2020-01-22 13:48:56 -08:00
Matthew Ahrens	8805abb8fc	single-chunk scatter ABDs can be treated as linear Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by `87c25d567`. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8580	2020-01-22 13:48:56 -08:00
Matthew Ahrens	bee5738f77	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2020-01-22 13:48:56 -08:00
Kody A Kantor	c37fa0d5a8	Disabled resilver_defer feature leads to looping resilvers When a disk is replaced with another on a pool with the resilver_defer feature present, but not enabled the resilver activity restarts during each spa_sync. This patch checks to make sure that the resilver_defer feature is first enabled before requesting a deferred resilver. This was originally fixed in illumos-joyent as OS-7982. Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Kody A Kantor <kody@kkantor.com> External-issue: illumos-joyent OS-7982 Closes #9299 Closes #9338	2019-09-25 11:27:51 -07:00
Andriy Gapon	12a78fbb4f	Fix dsl_scan_ds_clone_swapped logic The was incorrect with respect to swapping dataset IDs both in the on-disk ZAP object and the in-memory queue. In both cases, if ds1 was already present, then it would be first replaced with ds2 and then ds would be replaced back with ds1. Also, both cases did not properly handle a situation where both ds1 and ds2 are already queued. A duplicate insertion would be attempted and its failure would result in a panic. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #9140 Closes #9163	2019-09-25 11:27:51 -07:00
loli10K	63d8f57fe7	Scrubbing root pools may deadlock on kernels without elevator_change() (#9321 ) Originally the zfs_vdev_elevator module option was added as a convenience so the requested elevator would be automatically set on the underlying block devices. At the time this was simple because the kernel provided an API function which did exactly this. This API was then removed in the Linux 4.12 kernel which prompted us to add compatibly code to set the elevator via a usermodehelper. Unfortunately changing the evelator via usermodehelper requires reading some userland binaries, most notably modprobe(8) or sh(1), from a zfs dataset on systems with root-on-zfs. This can deadlock the system if used during the following call path because it may need, if the data is not already cached in the ARC, reading directly from disk while holding the spa config lock as a writer: zfs_ioc_pool_scan() -> spa_scan() -> spa_scan() -> vdev_reopen() -> vdev_elevator_switch() -> call_usermodehelper() While the usermodehelper waits sh(1), modprobe(8) is blocked in the ZIO pipeline trying to read from disk: INFO: task modprobe:2650 blocked for more than 10 seconds. Tainted: P OE 5.2.14 modprobe D 0 2650 206 0x00000000 Call Trace: ? __schedule+0x244/0x5f0 schedule+0x2f/0xa0 cv_wait_common+0x156/0x290 [spl] ? do_wait_intr_irq+0xb0/0xb0 spa_config_enter+0x13b/0x1e0 [zfs] zio_vdev_io_start+0x51d/0x590 [zfs] ? tsd_get_by_thread+0x3b/0x80 [spl] zio_nowait+0x142/0x2f0 [zfs] arc_read+0xb2d/0x19d0 [zfs] ... zpl_iter_read+0xfa/0x170 [zfs] new_sync_read+0x124/0x1b0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x4f/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This commit changes how we use the usermodehelper functionality from synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents scrubs, and other vdev_elevator_switch() consumers, from triggering the aforementioned issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Issue #8664 Closes #9321	2019-09-25 11:27:51 -07:00
Chengfei ZHu	9fa8b5b55b	QAT related bug fixes 1. Fix issue: Kernel BUG with QAT during decompression #9276. Now it is uninterruptible for a specific given QAT request, but Ctrl-C interrupt still works in user-space process. 2. Copy the digest result to the buffer only when doing encryption, and vise-versa for decryption. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com> Closes #9276 Closes #9303	2019-09-25 11:27:51 -07:00
Brian Behlendorf	97d4986214	Fix /etc/hostid on root pool deadlock Accidentally introduced by `dc04a8c` which now takes the SCL_VDEV lock as a reader in zfs_blkptr_verify(). A deadlock can occur if the /etc/hostid file resides on a dataset in the same pool. This is because reading the /etc/hostid file may occur while the caller is holding the SCL_VDEV lock as a writer. For example, to perform a `zpool attach` as shown in the abbreviated stack below. To resolve the issue we cache the system's hostid when initializing the spa_t, or when modifying the multihost property. The cached value is then relied upon for subsequent accesses. Call Trace: spa_config_enter+0x1e8/0x350 [zfs] zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock zio_read+0x6c/0x140 [zfs] ... vfs_read+0xfc/0x1e0 kernel_read+0x50/0x90 ... spa_get_hostid+0x1c/0x38 [zfs] spa_config_generate+0x1a0/0x610 [zfs] vdev_label_init+0xa0/0xc80 [zfs] vdev_create+0x98/0xe0 [zfs] spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9256 Closes #9285	2019-09-25 11:27:51 -07:00
Andriy Gapon	beb21db3c6	Always refuse receving non-resume stream when resume state exists This fixes a hole in the situation where the resume state is left from receiving a new dataset and, so, the state is set on the dataset itself (as opposed to %recv child). Additionally, distinguish incremental and resume streams in error messages. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #9252	2019-09-25 11:27:51 -07:00
loli10K	13e5e396a3	Fix Intel QAT / ZFS compatibility on v4.7.1+ kernels This change use the compat code introduced in `9cc1844a`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9268 Closes #9269	2019-09-25 11:27:51 -07:00
Chunwei Chen	c7a4255f12	Fix zil replay panic when TX_REMOVE followed by TX_CREATE If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7151 Closes #8910 Closes #9123 Closes #9145	2019-09-25 11:27:51 -07:00
Andriy Gapon	931bef81c8	zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets Previously, the permissions were checked on the pool which was obviously incorrect. After this change, zfs_check_userprops() only validates the properties without any permission checks. The permissions are checked individually for each snapshotted dataset. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #9179 Closes #9180	2019-09-25 11:27:50 -07:00
Tom Caputi	95319fc569	Fix deadlock in 'zfs rollback' Currently, the 'zfs rollback' code can end up deadlocked due to the way the kernel handles unreferenced inodes on a suspended fs. Essentially, the zfs_resume_fs() code path may cause zfs to spawn new threads as it reinstantiates the suspended fs's zil. When a new thread is spawned, the kernel may attempt to free memory for that thread by freeing some unreferenced inodes. If it happens to select inodes that are a a part of the suspended fs a deadlock will occur because freeing inodes requires holding the fs's z_teardown_inactive_lock which is still held from the suspend. This patch corrects this issue by adding an additional reference to all inodes that are still present when a suspend is initiated. This prevents them from being freed by the kernel for any reason. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9203	2019-09-25 11:27:50 -07:00
Tony Hutter	023ab67a64	Linux 5.3: Fix switch() fall though compiler errors Fix some switch() fall-though compiler errors: abd.c:1504:9: error: this statement may fall through Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #9170	2019-09-25 11:27:50 -07:00
Dominic Pearson	65469f6e30	Linux 5.3 compat: Makefile subdir-m no longer supported Uses obj-m instead, due to kernel changes. See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Dominic Pearson <dsp@technoanimal.net> Closes #9169	2019-09-25 11:27:50 -07:00
Chunwei Chen	569f5d5d05	Fix out-of-order ZIL txtype lost on hardlinked files We should only call zil_remove_async when an object is removed. However, in current implementation, it is called whenever TX_REMOVE is called. In the case of hardlinked file, every unlink will generate TX_REMOVE and causing operations to be dropped even when the object is not removed. We fix this by only calling zil_remove_async when the file is fully unlinked. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #8769 Closes #9061	2019-09-25 11:27:50 -07:00
Matthew Ahrens	6c9882d5db	Improve performance by using dmu_tx_hold__by_dnode() In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold__by_dnode() instead of dmu_tx_hold_*(), since we already have a dbuf from the target dnode in hand. This eliminates some calls to dnode_hold(), which can be expensive. This is especially impactful if several threads are accessing objects that are in the same block of dnodes, because they will contend for that dbuf's lock. We are seeing 10-20% performance wins for the sequential_writes tests in the performance test suite, when doing >=128K writes to files with recordsize=8K. This also removes some unnecessary casts that are in the area. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9081	2019-09-25 11:27:50 -07:00
Brian Behlendorf	8c00159411	Fix channel programs on s390x When adapting the original sources for s390x the JMP_BUF_CNT was mistakenly halved due to an incorrect assumption of the size of a unsigned long. They are 8 bytes for the s390x architecture. Increase JMP_BUF_CNT accordingly. Authored-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Colin Ian King <canonical.com> Tested-by: Colin Ian King <canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8992 Closes #9080	2019-09-25 11:27:50 -07:00
Tomohiro Kusumi	6c68594675	Implement secpolicy_vnode_setid_retain() Don't unconditionally return 0 (i.e. retain SUID/SGID). Test CAP_FSETID capability. https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t which expects SUID/SGID to be dropped on write(2) by non-owner fails without this. Most filesystems make this decision within VFS by using a generic file write for fops. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9035 Closes #9043	2019-09-25 11:27:50 -07:00
Tomohiro Kusumi	4f951b183c	Don't directly cast unsigned long to void* Cast to uintptr_t first for portability on integer to/from pointer conversion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9065	2019-09-25 11:27:50 -07:00
Tomohiro Kusumi	65a0b28b42	Fix module_param() type for zfs_read_chunk_size zfs_read_chunk_size is unsigned long. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9051	2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos	1c4b0fc745	Race condition between spa async threads and export In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9015 Closes #9044	2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos	bbbe4b0a98	hdr_recl calls zthr_wakeup() on destroyed zthr There exists a race condition were hdr_recl() calls zthr_wakeup() on a destroyed zthr. The timeline is the following: [1] hdr_recl() runs first and goes intro zthr_wakeup() because arc_initialized is set. [2] arc_fini() is called by another thread, zeroes that flag, destroying the zthr, and goes into buf_init(). [3] hdr_recl() tries to enter the destroyed mutex and we blow up. This patch ensures that the ARC's zthrs are not offloaded any new work once arc_initialized is set and then destroys them after all of the ARC state has been deleted. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9047	2019-09-25 11:27:50 -07:00
Tomohiro Kusumi	3c144b9267	Fix wrong comment on zcr_blksz_{min,max} These aren't tunable; illumos has this comment fixed in "3742 zfs comments need cleaner, more consistent style", so sync with that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9052	2019-09-25 11:27:50 -07:00
Brian Behlendorf	428a63cc62	Retire unused spl_{mutex,rwlock}_{init_fini} These functions are unused and can be removed along with the spl-mutex.c and spl-rwlock.c source files. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-09-25 11:27:49 -07:00
Brian Behlendorf	3982d959c5	Linux 5.3 compat: retire rw_tryupgrade() The Linux kernel's rwsem's have never provided an interface to allow a reader to be upgraded to a writer. Historically, this functionality has been implemented by a SPL wrapper function. However, this approach depends on internal knowledge of the rw_semaphore and is therefore rather brittle. Since the ZFS code must always be able to fallback to rw_exit() and rw_enter() when an rw_tryupgrade() fails; this functionality isn't critical. Furthermore, the only potentially performance sensitive consumer is dmu_zfetch() and no decrease in performance was observed with this change applied. See the PR comments for additional testing details. Therefore, it is being retired to make the build more robust and to simplify the rwlock implementation. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-09-25 11:27:49 -07:00
Brian Behlendorf	54561073e7	Linux 5.3 compat: rw_semaphore owner Commit https://github.com/torvalds/linux/commit/94a9717b updated the rwsem's owner field to contain additional flags describing the rwsem's state. Rather then update the wrappers to mask out these bits, the code no longer relies on the owner stored by the kernel. This does increase the size of a krwlock_t but it makes the implementation less sensitive to future kernel changes. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-09-25 11:27:49 -07:00
jdike	4c98586daf	Fix lockdep recursive locking false positive in dbuf_destroy lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] * DEADLOCK * [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #8984	2019-09-25 11:27:49 -07:00
Michael Niewöhner	ceb516ac2f	Add missing __GFP_HIGHMEM flag to vmalloc Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for some 32-bit systems to make use of full available memory. While kernel versions >=4.12-rc1 add this flag implicitly, older kernels do not. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #9031	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	2b9f73e5e6	Use zfsctl_snapshot_hold() wrapper zfs_refcount_() are to be wrapped by zfsctl_snapshot_() in this file. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9039	2019-09-25 11:27:49 -07:00
Brian Behlendorf	984bfb373f	Minor style cleanup Resolve an assortment of style inconsistencies including use of white space, typos, capitalization, and line wrapping. There is no functional change. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9030	2019-09-25 11:27:49 -07:00
Brian Behlendorf	446d08fba4	Fix get_special_prop() build failure The cast of the size_t returned by strlcpy() to a uint64_t by the VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE is set. This is due to the additional hardening. Since the token is expected to always fit in strval the VERIFY3U has been removed. If somehow it doesn't, it will still be safely truncated. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8999 Closes #9020	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	73e50a7d5d	Drop redundant POSIX ACL check in zpl_init_acl() ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(), so no need to test again on POSIX ACL access. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9009	2019-09-25 11:27:49 -07:00
Brian Behlendorf	d751b12a9d	Export dnode symbols External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027	2019-09-25 11:27:49 -07:00
Tom Caputi	78831d4290	Ensure dsl_destroy_head() decrypts objsets This patch corrects a small issue where the dsl_destroy_head() code that runs when the async_destroy feature is disabled would not properly decrypt the dataset before beginning processing. If the dataset is not able to be decrypted, the optimization code now simply does not run and the dataset is completely destroyed in the DSL sync task. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9021	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	0a223246e1	Disable unused pathname::pn_path* (unneeded in Linux) struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9025	2019-09-25 11:27:49 -07:00
Nick Mattis	cf966cb19a	Fixes: #8934 Large kmem_alloc Large allocation over the spl_kmem_alloc_warn value was being performed. Switched to vmem_alloc interface as specified for large allocations. Changed the subsequent frees to match. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: nmattis <nickm970@gmail.com> Closes #8934 Closes #9011	2019-09-25 11:27:49 -07:00
Tom Caputi	1f72a18f59	Remove VERIFY from dsl_dataset_crypt_stats() This patch fixes an issue where dsl_dataset_crypt_stats() would VERIFY that it was able to hold the encryption root. This function should instead silently continue without populating the related field in the nvlist, as is the convention for this code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8976	2019-09-25 11:27:49 -07:00
Paul Zuchowski	14a11bf2f6	Improve "Unable to automount" error message. Having the mountpoint and dataset name both in the message made it confusing to read. Additionally, convert this to a zfs_dbgmsg rather than sending it to the console. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #8959	2019-09-25 11:27:49 -07:00
Brian Behlendorf	7a03d7c73c	Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY is set. Therefore, the EQUIV check must be wrapped accordingly. For the same reason the ASSERT in arc_buf_fill() in unsafe. However, since it's largely redundant it has simply been removed. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8979	2019-09-25 11:27:49 -07:00
Tomohiro Kusumi	093bb64461	Don't use d_path() for automount mount point for chroot'd process Chroot'd process fails to automount snapshots due to realpath(3) failure in mount.zfs(8). Construct a mount point path from sb of the ctldir inode and dirent name, instead of from d_path(), so that chroot'd process doesn't get affected by its view of fs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8903 Closes #8966	2019-09-25 11:27:48 -07:00
George Wilson	7d2489cfad	nopwrites on dmu_sync-ed blocks can result in a panic After device removal, performing nopwrites on a dmu_sync-ed block will result in a panic. This panic can show up in two ways: 1. an attempt to issue an IOCTL in vdev_indirect_io_start() 2. a failed comparison of zio->io_bp and zio->io_bp_orig in zio_done() To resolve both of these panics, nopwrites of blocks on indirect vdevs should be ignored and new allocations should be performed on concrete vdevs. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #8957	2019-09-25 11:27:48 -07:00
Alexander Motin	04d4df89f4	Avoid extra taskq_dispatch() calls by DMU DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8909	2019-09-25 11:27:48 -07:00
Tom Caputi	bfe5f029cf	Fix error message on promoting encrypted dataset This patch corrects the error message reported when attempting to promote a dataset outside of its encryption root. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8905 Closes #8935	2019-09-25 11:27:48 -07:00
Brian Behlendorf	cc7fe8a599	Fix out-of-tree build failures Resolve the incorrect use of srcdir and builddir references for various files in the build system. These have crept in over time and went unnoticed because when building in the top level directory srcdir and builddir are identical. With this change it's again possible to build in a subdirectory. $ mkdir obj $ cd obj $ ../configure $ make Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8921 Closes #8943	2019-09-25 11:27:48 -07:00
Matthew Ahrens	7d64595c25	dn_struct_rwlock can not be held in dmu_tx_try_assign() The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock while assigning the tx, because this can lead to deadlock. Specifically, if this dnode is already assigned to an earlier txg, this thread may need to wait for that txg to sync (the ERESTART case below). The other thread that has assigned this dnode to an earlier txg prevents this txg from syncing until its tx can complete (calling dmu_tx_commit()), but it may need to acquire the dn_struct_rwlock to do so (e.g. via dmu_buf_hold*()). This commit adds an assertion to dmu_tx_try_assign() to ensure that this deadlock is not inadvertently introduced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8929	2019-09-25 11:27:48 -07:00
Paul Dagnelie	1fd28bd8d4	Add SCSI_PASSTHROUGH to zvols to enable UNMAP support When exporting ZVOLs as SCSI LUNs, by default Windows will not issue them UNMAP commands. This reduces storage efficiency in many cases. We add the SCSI_PASSTHROUGH flag to the zvol's device queue, which lets the SCSI target logic know that it can handle SCSI commands. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8933	2019-09-25 11:27:48 -07:00
Tomohiro Kusumi	ab24c9cd4c	Prevent pointer to an out-of-scope local variable `show_str` could be a pointer to a local variable in stack which is out-of-scope by the time `return (snprintf(buf, buflen, "%s\n", show_str));` is called. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8924 Closes #8940	2019-09-25 11:27:48 -07:00
Matthew Ahrens	3c2a42fd25	dedup=verify doesn't clear the blkptr's dedup flag The logic to handle strong checksum collisions where the data doesn't match is incorrect. It is not clearing the dedup bit of the blkptr, which can cause a panic later in zio_ddt_free() due to the dedup table not matching what is in the blkptr. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-48097 Closes #8936	2019-09-25 11:27:48 -07:00
Igor K	9af524b0ee	Update vdev_ops_t from illumos Align vdev_ops_t from illumos for better compatibility. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #8925	2019-09-25 11:27:48 -07:00
Tom Caputi	b96ceeead2	Allow unencrypted children of encrypted datasets When encryption was first added to ZFS, we made a decision to prevent users from creating unencrypted children of encrypted datasets. The idea was to prevent users from inadvertently leaving some of their data unencrypted. However, since the release of 0.8.0, some legitimate reasons have been brought up for this behavior to be allowed. This patch simply removes this limitation from all code paths that had checks for it and updates the tests accordingly. Reviewed-by: Jason King <jason.king@joyent.com> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8737 Closes #8870	2019-09-25 11:27:48 -07:00
Alexander Motin	ed7b0d357a	Minimize aggsum_compare(&arc_size, arc_c) calls. For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8901	2019-09-25 11:27:47 -07:00
Matthew Ahrens	6083f40387	panic in removal_remap test on 4K devices If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-61342 Closes #8893	2019-09-25 11:27:47 -07:00
Matthew Ahrens	592ee2e6dd	compress metadata in later sync passes Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable compression (including of metadata). Ostensibly this helps the sync passes to converge (i.e. for a sync pass to not need to allocate anything because it is 100% overwrites). However, in practice it increases the average number of sync passes, because when we turn compression off, a lot of block's size will change and thus we have to re-allocate (not overwrite) them. It also increases the number of 128KB allocations (e.g. for indirect blocks and spacemaps) because these will not be compressed. The 128K allocations are especially detrimental to performance on highly fragmented systems, which may have very few free segments of this size, and may need to load new metaslabs to satisfy 128K allocations. We should increase zfs_sync_pass_dont_compress. In practice on a highly fragmented system we see a few 5-pass txg's, a tiny number of 6-pass txg's, and no txg's with more than 6 passes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-63431 Closes #8892	2019-09-25 11:27:47 -07:00
Alexander Motin	cab7d856ea	Move write aggregation memory copy out of vq_lock Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8890	2019-09-25 11:27:47 -07:00
Tulsi Jain	19cebf0518	Restrict filesystem creation if name referred either '.' or '..' This change restricts filesystem creation if the given name contains either '.' or '..' Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: TulsiJain <tulsi.jain@delphix.com> Closes #8842 Closes #8564	2019-09-25 11:27:47 -07:00
Matthew Ahrens	77e64c6fff	ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread() When running zloop, we occasionally see the following crash: dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0) ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3] The error value 0x1c is ENOSPC. The transaction used by spa_vdev_remove_thread() should not be able to fail due to being out of space. i.e. we should not call dmu_tx_hold_space(). This will allow the removal thread to schedule its work even when the pool is low on space. The "slop space" will provide enough free space to sync out the txg. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-37853 Closes #8889	2019-09-25 11:27:47 -07:00
Tomohiro Kusumi	4f809bddc6	Fix lockdep warning on insmod sysfs_attr_init() is required to make lockdep happy for dynamically allocated sysfs attributes. This fixed #8868 on Fedora 29 running kernel-debug. This requirement was introduced in 2.6.34. See include/linux/sysfs.h for what it actually does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8868 Closes #8884	2019-09-25 11:27:47 -07:00
Matthew Ahrens	516a08ebb4	fat zap should prefetch when iterating When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58347 Closes #8862	2019-09-25 11:27:47 -07:00
Matthew Ahrens	812c36fc71	Target ARC size can get reduced to arc_c_min Sometimes the target ARC size is reduced to arc_c_min, which impacts performance. We've seen this happen as part of the random_reads performance regression test, where the ARC size is reduced before the reads test starts which impacts how long it takes for system to reach good IOPS performance. We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE, and arc_available_memory() is less than arc_c>>arc_shrink_shift. However, arc_available_memory() could easily be low, even when arc_c is low, because we can have tons of unused bufs in the abd kmem cache. This would be especially true just after the DMU requests a bunch of stuff be evicted from the ARC (e.g. due to "zpool export"). To fix this, the ARC should reduce arc_c by the requested amount, not all the way down to arc_size (or arc_c_min), which can be very small. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-59431 Closes #8864	2019-09-25 11:27:47 -07:00
bnjf	fe11968bbf	Fix typo in vdev_raidz_math.c Fix typo in vdev_raidz_math.c Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brad Forschinger <github@bnjf.id.au> Closes #8875 Closes #8880	2019-09-25 11:27:47 -07:00
Paul Dagnelie	6f7bc75825	Allow metaslab to be unloaded even when not freed from On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8837	2019-09-25 11:27:47 -07:00
Allan Jude	60cbc18136	l2arc_apply_transforms: Fix typo in comment Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #8822	2019-09-25 11:27:46 -07:00
Serapheim Dimitropoulos	b63ed49c29	Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8859	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	fafe72712a	Drop objid argument in zfs_znode_alloc() (sync with OpenZFS) Since zfs_znode_alloc() already takes dmu_buf_t*, taking another uint64_t argument for objid is redundant. inode's ->i_ino does and needs to match znode's ->z_id. zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument since vnode doesn't have vnode# in VFS (hence ->z_id exists). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8841	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	328c95e391	Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod) Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8826	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	e5a877c5d0	Update descriptions for vnops These descriptions are not uptodate with the code. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8767	2019-09-25 11:27:46 -07:00
Tomohiro Kusumi	4933b0a25b	Drop local definition of MOUNT_BUSY It's accessible via <sys/mntent.h>. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8765	2019-09-25 11:27:46 -07:00
Rafael Kitover	e0cd6c28a3	kernel timer API rework In `config/kernel-timer.m4` refactor slightly to check more generally for the new `timer_setup()` APIs, but also check the callback signature because some kernels (notably 4.14) have the new `timer_setup()` API but use the old callback signature. Also add a check for a `flags` member in `struct timer_list`, which was added in 4.1-rc8. Add compatibility shims to `include/spl/sys/timer.h` to allow using the new timer APIs with the only two caveats being that the callback argument type must be declared as `spl_timer_list_t` and an explicit assignment is required to get the timer variable for the `timer_of()` macro. So the callback would look like this: ```c __cv_wakeup(spl_timer_list_t t) { struct timer_list tmr = (struct timer_list )t; struct thing parent = from_timer(parent, tmr, parent_timer_field); ... / do stuff with parent */ ``` Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the new timer APIs instead of conditional code. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rafael Kitover <rkitover@gmail.com> Closes #8647	2019-09-25 11:27:46 -07:00
Alexander Motin	72888812b0	Fix comparison signedness in arc_is_overflowing() When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8873	2019-06-11 10:45:23 -07:00
Tom Caputi	581c77e725	Fix incorrect error message for raw receive This patch fixes an incorrect error message that comes up when doing a non-forcing, raw, incremental receive into a dataset that has a newer snapshot than the "from" snapshot. In this case, the current code prints a confusing message about an IVset guid mismatch. This functionality is supported by non-raw receives as an undocumented feature, but was never supported by the raw receive code. If this is desired in the future, we can probably figure out a way to make it work. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8758 Closes #8863	2019-06-11 10:45:17 -07:00
Tom Caputi	8dc8bbde6e	Reinstate raw receive check when truncating This patch re-adds a check that was removed in `369aa50`. The check confirms that a raw receive is not occuring before truncating an object's dn_maxblkid. At the time, it was believed that all cases that would hit this code path would be handled in other places, but that was not the case. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8852 Closes #8857	2019-06-07 12:45:40 -07:00
Tom Caputi	35050ef39e	Fix integer overflow of ZTOI(zp)->i_generation The ZFS on-disk format stores each inode's generation ID as a 64 bit number on disk and in-core. However, the Linux kernel's inode is only a 32 bit number. In most places, the code handles this correctly, but the cast is missing in zfs_rezget(). For many pools, this isn't an issue since the generation ID is computed as the current txg when the inode is created and many pools don't have more than 2^32 txgs. For the pools that have more txgs, this issue causes any inode with a high enough generation number to report IO errors after a call to "zfs rollback" while holding the file or directory open. This patch simply adds the missing cast. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8858	2019-06-07 12:45:40 -07:00
DeHackEd	58b2de6420	Wait in 'S' state when send/recv pipe is blocking Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8733 Closes #8752	2019-06-07 12:45:40 -07:00
TulsiJain	11ad06d1d8	Make zfs_async_block_max_blocks handle zero correctly Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: TulsiJain <tulsi.jain@delphix.com> Closes #8829 Closes #8289	2019-06-07 12:45:40 -07:00
Brian Behlendorf	4f8eef29e0	Revert "Report holes when there are only metadata changes" This reverts commit `ec4f9b8f30` which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8816 Closes #8834	2019-06-07 12:45:40 -07:00
Brian Behlendorf	a1eaf0dde0	Exclude log device ashift from normal class When opening a log device during import its allocation bias will not yet have been set by vdev_load(). This results in the log device's ashift being incorrectly applied to the maximum ashift of the vdevs in the normal class. Which in turn prevents the removal of any top-level devices due to the ashift check in the spa_vdev_remove_top_check() function. This issue is resolved by including vdev_islog in the check since it will be set correctly during vdev_open(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8735	2019-06-07 12:45:40 -07:00
madz	580256045b	Fix integer overflow in get_next_chunk() dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr> Closes #8778 Closes #8797	2019-06-07 12:45:40 -07:00
loli10K	aaf3b30dcf	Double-free of encryption wrapping key due to invalid pool properties This commits fixes a double-free in zfs_ioc_pool_create() triggered by specifying an unsupported combination of properties when creating a pool with encryption enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8791	2019-06-07 12:45:40 -07:00
Tom Caputi	69ae34076f	Fix embedded bp accounting in count_block() Currently, count_block() does not correctly account for the possibility that the bp that is passed to it could be embedded. These blocks shouldn't be counted since the work of scanning these blocks in already handled when the containing block is scanned. This patch simply resolves this issue by returning early in this case. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8800 Closes #8766	2019-06-07 12:39:13 -07:00
Tomohiro Kusumi	2fb37bcadd	Linux 5.2 compat: Directly call wait_on_page_bit() wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd. Directly call wait_on_page_bit() without using wait_on_page_writeback() interface, given zfs_putpage() is the only caller for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8794	2019-06-07 12:39:13 -07:00
Tomohiro Kusumi	8ec352be1f	Linux 5.2 compat: Remove config/kernel-set-fs-pwd.m4 This failed on 5.2-rc1 with "error: unknown" message, for set_fs_pwd() not being visible in both const and non-const tests. This is caused by torvalds/linux@83da1bed86. It's configurable, but we would want to be able to compile with default kbuild setting. set_fs_pwd() has never been exported with exception of some distro kernels, and set_fs_pwd() wasn't used in ZoL to begin with. The test result was used for a spl function vn_set_fs_pwd(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8777	2019-06-07 12:39:13 -07:00
loli10K	f91e7e6284	Device removal panics on 32-bit systems The issue is caused by an incorrect usage of the sizeof() operator in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on 32-bit systems pointers are 4 bytes long which is not supported by zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system will cause the following failure: VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22) PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete() Showing stack for process 1315 CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76 Call Trace: [<>] dump_stack+0x58/0x7c [<>] spl_dumpstack+0x23/0x27 [spl] [<>] spl_panic.cold.0+0x5/0x41 [spl] [<>] ? dbuf_rele+0x3e/0x90 [zfs] [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs] [<>] ? zap_lookup+0x57/0x70 [zfs] [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs] [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs] [<>] ? txg_verify+0x1d/0x160 [zfs] [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs] [<>] vdev_sync+0xbf/0x550 [zfs] [<>] ? mutex_lock+0x10/0x30 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs] [<>] ? zap_contains+0x4d/0x70 [zfs] [<>] spa_sync+0x9f1/0x1b10 [zfs] ... [<>] ? kthread_stop+0x110/0x110 This commit simply corrects the "integer_size" parameter used to lookup the vdev's ZAP object. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8790	2019-06-07 12:39:13 -07:00
loli10K	cc434dcf45	Fix coverity defects: CID 186143 CID 186143: Memory - illegal accesses (USE_AFTER_FREE) This patch fixes an use-after-free in spa_import_progress_destroy() moving the kmem_free() call at the end of the function. Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8788	2019-06-07 12:39:13 -07:00
Richard Elling	78fac8d925	Fix kstat state update during pool transition When reading kstats, the health (aka state) of the pool is stored into /proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name(). However, during import/export there is a case where the spa exists, but the root vdev does not exist. This fix checks that case and sets the state to "TRANSITIONING" Unfortunately, it is not easy to reproduce a test for this. It was detected randomly during ZTS runs while kstats were also being sampled regularly. After this change, further testing did not trip on the case and the TRANSITIONING state was collected at least once by the kstats. For posterity, the backtrace prior to this fix is: [Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs] ... Mon May 13 17:21:00 2019] Call Trace: [Mon May 13 17:21:00 2019] spa_state_data+0x1a/0x40 [zfs] [Mon May 13 17:21:00 2019] kstat_seq_show+0x117/0x440 [spl] [Mon May 13 17:21:00 2019] seq_read+0xe5/0x430 [Mon May 13 17:21:00 2019] proc_reg_read+0x45/0x70 [Mon May 13 17:21:00 2019] __vfs_read+0x1b/0x40 [Mon May 13 17:21:00 2019] vfs_read+0x8e/0x130 [Mon May 13 17:21:00 2019] SyS_read+0x55/0xc0 [Mon May 13 17:21:00 2019] ? SyS_fcntl+0x5d/0xb0 [Mon May 13 17:21:00 2019] do_syscall_64+0x73/0x130 [Mon May 13 17:21:00 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes #8746	2019-05-23 14:28:53 -07:00
Brian Behlendorf	bff2361aeb	Linux 5.2 compat: rw_tryupgrade() Commit torvalds/linux@46ad0840b has removed the architecture specific rwsem source and headers leaving only the generic version. As part of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS macros were moved to the private kernel/locking/rwsem.h header. This results in a build failure because these macros were required to implement the rw_tryupgrade() compatibility function. In practice, this isn't a major problem because there are only a few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade should be written to retry using rw_enter(RW_WRITER). After auditing all of the callers only dmu_zfetch() was determined not to perform a retry. It has been updated in this commit to resolve this issue. That said, the rw_tryupgrade() functionality should be considered for possible removal in a future release due to the difficultly in supporting the interface. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8730	2019-05-23 13:46:33 -07:00
Paul Dagnelie	e61b53475e	Fix incorrect assertion in dnode_dirty_l1range The db_dirtycnt of an EVICTING dbuf is always 0. However, it still appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a range that contains an EVICTING dbuf, we will attempt to mark it dirty (which will fail because it's EVICTING, resulting in a new dbuf being created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the dbufs in the range are dirty. If the EVICTING dbuf is still present, this will trip the assertion erroneously. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8745	2019-05-19 17:30:33 -07:00
Olaf Faaland	ca95f70dff	zpool import progress kstat When an import requires a long MMP activity check, or when the user requests pool recovery, the import make take a long time. The user may not know why, or be able to tell whether the import is progressing or is hung. Add a kstat which lists all imports currently being processed by the kernel (currently only one at a time is possible, but the kstat allows for more than one). The kstat is /proc/spl/kstat/zfs/import_progress. The kstat contents are as follows: pool_guid load_state multihost_secs max_txg pool_name 16667015954387398 3 15 0 tank3 load_state: the value of spa_load_state multihost_secs: seconds until the end of the multihost activity check; if over, or none required, this is 0 max_txg: current spa_load_max_txg, if rewind is occurring This could be used by outside tools, such as a pacemaker resource agent, to report import progress, or as a part of manual troubleshooting. The zpool import subcommand could also be modified to report this information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #8696	2019-05-09 10:08:05 -07:00
Tomohiro Kusumi	b689de85e8	Add missing trailing '\n' in printk() messages These messages will want '\n' like any other regular printk() messages. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8726	2019-05-08 16:43:55 -07:00
Tomohiro Kusumi	97aa3ba44f	Fix link count of root inode when snapdir is visible Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by ->getattr() of zpl inodes) also needs to consider an additional link count if "snapdir" property is set to "visible". Without this, # of directories in root inode of each dataset doesn't match the link count when snapdir is visible. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8727	2019-05-08 16:40:51 -07:00
Brian Behlendorf	a20f43b51b	Linux 5.0 compat: ASM_BUG macro The 5.0 kernel defines the macro ASM_BUG. In order to prevent a conflict and build failure rename ASM_BUG to ZFS_ASM_BUG. This is currently only an issue on aarch64 but all instances of ASM_BUG we're renamed to avoid any future conflict on x86_64. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8725 Issue #8545	2019-05-08 10:18:40 -07:00
Brian Behlendorf	515ddf6504	Fix errant EFAULT during writes (#8719 ) Commit `98bb45e` resolved a deadlock which could occur when handling a page fault in zfs_write(). This change added the uio_fault_disable field to the uio structure but failed to initialize it to B_FALSE. This uninitialized field would cause uiomove_iov() to call __copy_from_user_inatomic() instead of copy_from_user() resulting in unexpected EFAULTs. Resolve the issue by fully initializing the uio, and clearing the uio_fault_disable flags after it's used in zfs_write(). Additionally, reorder the uio_t field assignments to match the order the fields are declared in the structure. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8640 Closes #8719	2019-05-08 10:04:04 -07:00
DeHackEd	1f02ecc5a5	Make zfs_special_class_metadata_reserve_pct into a parameter Exported and documented a new module parameter. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8706	2019-05-07 15:34:42 -07:00
Brian Behlendorf	caf9dd209f	Fix send/recv lost spill block When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8668	2019-05-07 15:18:44 -07:00
Tomohiro Kusumi	9c53e51616	Fix `zfs set atime\|relatime=off\|on` behavior on inherited datasets `zfs set atime\|relatime=off\|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime\|relatime=off\|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8674 Closes #8675	2019-05-07 10:06:30 -07:00
Tomohiro Kusumi	a762893269	Fix typo/etc in module/zfs/zfs_ctldir.c Drop duplicated phrases in comments. Also drop an obsolete comment "Perform a mount of the associated...", as all it does now is get objid from DMU and lookup incore inode. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8707	2019-05-05 15:57:23 -07:00

1 2 3 4 5 ...

2716 Commits