mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	9e17e6f254	Remove zfs_vdev_elevator module option As described in commit `f81d5ef6` the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9417 Closes #9609	2019-11-27 10:35:49 -08:00
jwpoduska	3c819a2c7d	Prevent unnecessary resilver restarts If a device is participating in an active resilver, then it will have a non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the resilver to be restarted (or deferred to be restarted later), which is unnecessary if the DTL is still covered by the current scan range. This is similar to the logic in vdev_dtl_should_excise() where the DTL can only be excised if it's max txg is in the resilvered range. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: John Poduska <jpoduska@datto.com> Issue #840 Closes #9155 Closes #9378 Closes #9551 Closes #9588	2019-11-27 10:15:01 -08:00
Matthew Macy	da92d5cbb3	Add zfs_file_* interface, remove vnodes Provide a common zfs_file_* interface which can be implemented on all platforms to perform normal file access from either the kernel module or the libzpool library. This allows all non-portable vnode_t usage in the common code to be replaced by the new portable zfs_file_t. The associated vnode and kobj compatibility functions, types, and macros have been removed from the SPL. Moving forward, vnodes should only be used in platform specific code when provided by the native operating system. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9556	2019-11-21 09:32:57 -08:00
Brian Behlendorf	7ae3f8dc8f	Partially revert `5a6ac4c` Reinstate the zpl_revalidate() functionality to resolve a regression where dentries for open files during a rollback are not invalidated. The unrelated functionality for automatically unmounting .zfs/snapshots was not reverted. Nor was the addition of shrink_dcache_sb() to the zfs_resume_fs() function. This issue was not immediately caught by the CI because the test case intended to catch it was included in the list of ZTS tests which may occasionally fail for unrelated reasons. Remove all of the rollback tests from this list to help identify the frequency of any spurious failures. The rollback_003_pos.ksh test case exposes a real issue with the long standing code which needs to be investigated. Regardless, it has been enable with a small workaround in the test case itself. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9587 Closes #9592	2019-11-18 13:05:56 -08:00
Michael Niewöhner	6d948c3519	Add kmem_cache flag for forcing kvmalloc This adds a new KMC_KVMEM flag was added to enforce use of the kvmalloc allocator in kmem_cache_create even for large blocks, which may also increase performance in some specific cases (e.g. zstd), too. Default to KVMEM instead of VMEM in spl_kmem_cache_create. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #9034	2019-11-13 10:05:23 -08:00
Michael Niewöhner	66955885e2	Make use of kvmalloc if available and fix vmem_alloc implementation This patch implements use of kvmalloc for GFP_KERNEL allocations, which may increase performance if the allocator is able to allocate physical memory, if kvmalloc is available as a public kernel interface (since v4.12). Otherwise it will simply fall back to virtual memory (vmalloc). Also fix vmem_alloc implementation which can lead to slow allocations since the first attempt with kmalloc does not make use of the noretry flag but tells the linux kernel to retry several times before it fails. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #9034	2019-11-13 10:05:10 -08:00
Matthew Macy	1f2f46be95	Add wrapper stub for zfs_cmd ioctl to libzpool FreeBSD needs a wrapper for handling zfs_cmd ioctls. In libzfs this is handled by zfs_ioctl. However, here we need to wrap the call directly. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9511	2019-11-12 10:40:39 -08:00
Brian Behlendorf	066e825221	Linux compat: Minimum kernel version 3.10 Increase the minimum supported kernel version from 2.6.32 to 3.10. This removes support for the following Linux enterprise distributions. Distribution \| Kernel \| End of Life ---------------- \| ------ \| ------------- Ubuntu 12.04 LTS \| 3.2 \| Apr 28, 2017 SLES 11 \| 3.0 \| Mar 32, 2019 RHEL / CentOS 6 \| 2.6.32 \| Nov 30, 2020 The following changes were made as part of removing support. * Updated `configure` to enforce a minimum kernel version as specified in the META file (Linux-Minimum: 3.10). configure: error: * Cannot build against kernel version 2.6.32. * The minimum supported kernel version is 3.10. * Removed all `configure` kABI checks and matching C code for interfaces which solely predate the Linux 3.10 kernel. * Updated all `configure` kABI checks to fail when an interface is missing which was in the 3.10 kernel up to the latest 5.1 kernel. Removed the HAVE_* preprocessor defines for these checks and updated the code to unconditionally use the verified interface. * Inverted the detection logic in several kABI checks to match the new interface as it appears in 3.10 and newer and not the legacy interface. * Consolidated the following checks in to individual files. Due the large number of changes in the checks it made sense to handle this now. It would be desirable to group other related checks in the same fashion, but this as left as future work. - config/kernel-blkdev.m4 - Block device kABI checks - config/kernel-blk-queue.m4 - Block queue kABI checks - config/kernel-bio.m4 - Bio interface kABI checks * Removed the kABI checks for sops->nr_cached_objects() and sops->free_cached_objects(). These interfaces are currently unused. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9566	2019-11-12 08:59:06 -08:00
Ryan Moeller	035ebb3653	Allow platform dependent path stripping for vdevs On Linux the full path preceding devices is stripped when formatting vdev names. On FreeBSD we only want to strip "/dev/". Hide the implementation details of path stripping behind zfs_strip_path(). Make zfs_strip_partition_path() static in Linux implementation while here, since it is never used outside of the file it is defined in. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9565	2019-11-11 12:15:44 -08:00
Pavel Snajdr	5a6ac4cffc	Remove zpl_revalidate This patch removes the need for zpl_revalidate altogether. There were 3 main reasons why we used d_revalidate: 1. periodic automounted snapshots umount deferral 2. negative dentries created before snapshot rollback 3. stale inodes referenced by dentry cache after snapshot rollback Periodic snapshots deferral solution introduces zfs_exit_fs function, which is called as a part of ZFS_EXIT(zfsvfs_t) macro. Negative dentries and stale inodes are solved by flushing the dcache for the particular dataset on zfs_resume_fs call. This patch also removes now unused HAVE_S_D_OP configure test. Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #8774 Closes #9549	2019-11-11 09:34:21 -08:00
Romain Dolbeau	4254e40729	Preliminary support for RV64G This adds basic support for RISC-V, specifically RV64G. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu> Closes #9540	2019-11-06 10:56:09 -08:00
Matthew Macy	27ece2ee4d	Move platform specific parts of zfs_znode.h to platform code Some of the znode fields are different and functions consuming an inode don't exist on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9536	2019-11-06 10:54:25 -08:00
Prakash Surya	ae38e00968	Add tracepoints for taskq entry lifetime events This adds some new DTRACE_PROBE* endpoints so that we can observe taskq latencies on a system. Additionally, a new "taskqlatency.bt" script is added to do this observation via "bpftrace". Lastly, a "zfs-trace.sh" script is added to wrap "bpftrace" with the proper options required to run and use "taskqlatency.bt". For example, with these changes in place, a user can run the following: $ cd ./contrib/bpftrace $ sudo ./zfs-trace.sh taskqlatency.bt Attaching 6 probes... ^C Here's some example output, showing latency information for time spent executing the taskq entry's function: @exec_lat_us[dp_sync_taskq, userquota_updates_task]: [2, 4) 5 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [4, 8) 0 \| \| [8, 16) 1 \|@@@@@@@@@@ \| [16, 32) 2 \|@@@@@@@@@@@@@@@@@@@@ \| @exec_lat_us[z_wr_int_h, zio_execute]: [8, 16) 16 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [16, 32) 2 \|@@@@@@ \| @exec_lat_us[z_wr_iss_h, zio_execute]: [16, 32) 4 \|@@@@@@@@@@@@@@@@ \| [32, 64) 13 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [64, 128) 1 \|@@@@ \| @exec_lat_us[z_ioctl_int, zio_execute]: [2, 4) 1 \|@@@@ \| [4, 8) 11 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [8, 16) 8 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| @exec_lat_us[dp_sync_taskq, sync_dnodes_task]: [2, 4) 1 \|@@@@@@ \| [4, 8) 7 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [8, 16) 8 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [16, 32) 2 \|@@@@@@@@@@@@@ \| [32, 64) 4 \|@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [64, 128) 1 \|@@@@@@ \| [128, 256) 0 \| \| [256, 512) 1 \|@@@@@@ Here's some example output, showing latency information for time spent waiting on the taskq, prior to starting execution of entry's function: @queue_lat_us[dp_sync_taskq]: [2, 4) 1 \|@@@@ \| [4, 8) 7 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [8, 16) 2 \|@@@@@@@@ \| [16, 32) 3 \|@@@@@@@@@@@@@ \| [32, 64) 12 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [64, 128) 6 \|@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [128, 256) 0 \| \| [256, 512) 1 \|@@@@ \| @queue_lat_us[z_wr_iss]: [4, 8) 4 \|@@@@ \| [8, 16) 13 \|@@@@@@@@@@@@@@@ \| [16, 32) 6 \|@@@@@@@ \| [32, 64) 2 \|@@ \| [64, 128) 12 \|@@@@@@@@@@@@@@ \| [128, 256) 15 \|@@@@@@@@@@@@@@@@@@ \| [256, 512) 33 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [512, 1K) 27 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [1K, 2K) 7 \|@@@@@@@@ \| [2K, 4K) 14 \|@@@@@@@@@@@@@@@@ \| [4K, 8K) 14 \|@@@@@@@@@@@@@@@@ \| [8K, 16K) 23 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [16K, 32K) 43 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| @queue_lat_us[z_wr_int]: [2, 4) 10 \|@@@@@ \| [4, 8) 71 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [8, 16) 88 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [16, 32) 50 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [32, 64) 65 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [64, 128) 43 \|@@@@@@@@@@@@@@@@@@@@@@@@@ \| [128, 256) 19 \|@@@@@@@@@@@ \| [256, 512) 3 \|@ \| [512, 1K) 1 \| \| Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #9525	2019-11-01 13:14:54 -07:00
Prakash Surya	e5d1c27e30	Enable use of DTRACE_PROBE* macros in "spl" module This change modifies some of the infrastructure for enabling the use of the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module. Currently, when the DTRACE_PROBE* macros are used, they get expanded to create new functions, and these dynamically generated functions become part of the "zfs" module. Since the "spl" module does not depend on the "zfs" module, the use of DTRACE_PROBE* in the "spl" module would result in undefined symbols being used in the "spl" module. Specifically, DTRACE_PROBE* would turn into a function call, and the function being called would be a symbol only contained in the "zfs" module; which results in a linker and/or runtime error. Thus, this change adds the necessary logic to the "spl" module, to mirror the tracing functionality available to the "zfs" module. After this change, we'll have a "trace_zfs.h" header file which defines the probes available only to the "zfs" module, and a "trace_spl.h" header file which defines the probes available only to the "spl" module. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #9525	2019-11-01 13:13:43 -07:00
Matthew Macy	4a2ed90013	Wrap Linux module macros MODULE_VERSION is already defined on FreeBSD. Wrap all of the used MODULE_* macros for the sake of consistency and portability. Add a user space noop version to reduce the need for _KERNEL ifdefs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9542	2019-11-01 10:41:03 -07:00
Matthew Macy	bd4dde8ef7	Prefix struct rangelock A struct rangelock already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. This change is a follow up to `2cc479d0`. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9534	2019-11-01 10:37:33 -07:00
Matthew Macy	156f74fc03	Return an error code from zfs_acl_chmod_setattr The FreeBSD implementation can fail, allow this function to fail and add the required error handling for Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9541	2019-11-01 10:19:11 -07:00
Matthew Macy	c4ae27c763	Move sha2.h to platform code FreeBSD has its own sha routines that the port uses. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9530	2019-10-31 15:45:58 -07:00
Matthew Macy	a5308e252d	Don't cast away const Follow up to `511fce6b` which missed a cast. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9533	2019-10-31 10:38:03 -07:00
Matthew Macy	2a3aa5a109	Factor Linux specific code out of spa_misc.c Move these Linux module parameter get/set helpers in to platform specific code. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9457	2019-10-31 09:52:22 -07:00
Romain Dolbeau	0b2a642351	Add AVX512BW variant of fletcher It is much faster than AVX512F when byteswapping on Skylake-SP and newer, as we can do the byteswap in a single vshufb instead of many instructions. Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #9517	2019-10-30 12:26:14 -07:00
Matthew Macy	d46f0deb03	Add wrapper for Linux BLKFLSBUF ioctl FreeBSD has no analog. Buffered block devices were removed a decade plus ago. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9508	2019-10-28 09:53:39 -07:00
Matthew Macy	4a22ba5be0	Minor spa portability fixes - FreeBSD's rootpool import code uses spa_config_parse - Move the zvol_create_minors call out from under the spa_namespace_lock in spa_import. It isn't needed and it causes a lock order reversal on FreeBSD. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9499	2019-10-28 09:51:53 -07:00
Chunwei Chen	7125a109dc	Fix zpool history unbounded memory usage In original implementation, zpool history will read the whole history before printing anything, causing memory usage goes unbounded. We fix this by breaking it into read-print iterations. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #9516	2019-10-28 09:49:44 -07:00
loli10K	e35704647e	Fix for ARC sysctls ignored at runtime This change leverage module_param_call() to run arc_tuning_update() immediately after the ARC tunable has been updated as suggested in `cffa8372` code review. A simple test case is added to the ZFS Test Suite to prevent future regressions in functionality. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9487 Closes #9489	2019-10-26 15:22:19 -07:00
Matthew Macy	1952fe0e25	Move platform dependent errno aliases EBADE, EBADR, and ENOANO do not exist on FreeBSD The libspl errno.h is similarly platform dependent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9498	2019-10-25 13:40:50 -07:00
Matthew Macy	68a1b1589a	Remove sdt.h It's mostly a noop on ZoL and it conflicts with platforms that support dtrace. Remove this header to resolve the conflict. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9497	2019-10-25 13:38:37 -07:00
Brian Behlendorf	10fa254539	Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #9346 Closes #9403	2019-10-24 10:17:33 -07:00
Serapheim Dimitropoulos	9f3c72a2a8	Name anonymous enum of KMC_BIT constants Giving a name to this enum makes it discoverable from debugging tools like DRGN and SDB. For example, with the name proposed on this patch we can iterate over these values in DRGN: ``` >>> prog.type('enum kmc_bit').enumerators (('KMC_BIT_NOTOUCH', 0), ('KMC_BIT_NODEBUG', 1), ('KMC_BIT_NOMAGAZINE', 2), ('KMC_BIT_NOHASH', 3), ('KMC_BIT_QCACHE', 4), ('KMC_BIT_KMEM', 5), ('KMC_BIT_VMEM', 6), ('KMC_BIT_SLAB', 7), ... ``` This enables SDB to easily pretty-print the flags of the spl_kmem_caches in the system like this: ``` > spl_kmem_caches -o "name,flags,total_memory" name flags total_memory ------------------------ ----------------------- ------------ abd_t KMC_NOMAGAZINE\|KMC_SLAB 4.5MB arc_buf_hdr_t_full KMC_NOMAGAZINE\|KMC_SLAB 12.3MB ... <cropped> ... ddt_cache KMC_VMEM 583.7KB ddt_entry_cache KMC_NOMAGAZINE\|KMC_SLAB 0.0B ... <cropped> ... zio_buf_1048576 KMC_NODEBUG\|KMC_VMEM 0.0B ... <cropped> ... ``` Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9478	2019-10-18 13:25:44 -04:00
Matthew Macy	c9c9c1e213	OpenZFS restructuring - ARC memory pressure Factor Linux specific memory pressure handling out of ARC. Each platform will have different available interfaces for managing memory pressure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9472	2019-10-18 13:23:19 -04:00
Matthew Macy	08f530c699	Make zfsdev_getminor signature cross platform Only pass the file descriptor to make zfsdev_get_miror() portable. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9466	2019-10-16 18:43:52 -07:00
Paul Dagnelie	511fce6b1f	Don't call sizeof on void We get the sizeof the appropriate type, and don't cast away const. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9455	2019-10-13 19:21:51 -07:00
Paul Dagnelie	516a83f886	Function name and comment updates Rename certain functions for more consistency when they share common features. Make comments clearer about what arguments should be passed to the insert and add functions. Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9441	2019-10-11 10:13:21 -07:00
Matthew Macy	af1698f59b	Expose dmu_buf_hold_array_by_dnode to platform code FreeBSD uses this in its pager ops routines Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9431	2019-10-11 10:06:18 -07:00
loli10K	715c996d3b	Fix pool creation with feature@allocation_classes disabled When "feature@allocation_classes" is not enabled on the pool no vdev with "special" or "dedup" allocation type should be allowed to exist in the vdev tree. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9427 Closes #9429	2019-10-10 16:39:41 -07:00
Matthew Macy	2516a87821	Move get_temporary_prop to platform code Temporary property handling at the VFS layer requires platform specific code. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9401	2019-10-10 15:59:34 -07:00
Matthew Macy	6501906280	Add kmem cache accessors Make the metaslab platform agnostic again by adding accessor functions which can be implemented by each platform. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9404	2019-10-10 15:45:52 -07:00
Matthew Macy	eedb3a62b9	Make `zil_async_to_sync` visible to platform code FreeBSD's zvol platform code requires access to the zil_async_to_sync() function. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9440	2019-10-10 15:39:44 -07:00
Matthew Macy	e4f5fa1229	Fix strdup conflict on other platforms In the FreeBSD kernel the strdup signature is: ``` char strdup(const char __restrict, struct malloc_type *); ``` It's unfortunate that the developers have chosen to change the signature of libc functions - but it's what I have to deal with. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9433	2019-10-10 09:47:06 -07:00
Paul Dagnelie	ca5777793e	Reduce loaded range tree memory usage This patch implements a new tree structure for ZFS, and uses it to store range trees more efficiently. The new structure is approximately a B-tree, though there are some small differences from the usual characterizations. The tree has core nodes and leaf nodes; each contain data elements, which the elements in the core nodes acting as separators between its children. The difference between core and leaf nodes is that the core nodes have an array of children, while leaf nodes don't. Every node in the tree may be only partially full; in most cases, they are all at least 50% full (in terms of element count) except for the root node, which can be less full. Underfull nodes will steal from their neighbors or merge to remain full enough, while overfull nodes will split in two. The data elements are contained in tree-controlled buffers; they are copied into these on insertion, and overwritten on deletion. This means that the elements are not independently allocated, which reduces overhead, but also means they can't be shared between trees (and also that pointers to them are only valid until a side-effectful tree operation occurs). The overhead varies based on how dense the tree is, but is usually on the order of about 50% of the element size; the per-node overheads are very small, and so don't make a significant difference. The trees can accept arbitrary records; they accept a size and a comparator to allow them to be used for a variety of purposes. The new trees replace the AVL trees used in the range trees today. Currently, the range_seg_t structure contains three 8 byte integers of payload and two 24 byte avl_tree_node_ts to handle its storage in both an offset-sorted tree and a size-sorted tree (total size: 64 bytes). In the new model, the range seg structures are usually two 4 byte integers, but a separate one needs to exist for the size-sorted and offset-sorted tree. Between the raw size, the 50% overhead, and the double storage, the new btrees are expected to use 81.52 = 24 bytes per record, or 33.3% as much memory as the AVL trees (this is for the purposes of storing metaslab range trees; for other purposes, like scrubs, they use ~50% as much memory). We reduced the size of the payload in the range segments by teaching range trees about starting offsets and shifts; since metaslabs have a fixed starting offset, and they all operate in terms of disk sectors, we can store the ranges using 4-byte integers as long as the size of the metaslab divided by the sector size is less than 2^32. For 512-byte sectors, this is a 2^41 (or 2TB) metaslab, which with the default settings corresponds to a 256PB disk. 4k sector disks can handle metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not anticipate disks of this size in the near future, there should be almost no cases where metaslabs need 64-byte integers to store their ranges. We do still have the capability to store 64-byte integer ranges to account for cases where we are storing per-vdev (or per-dnode) trees, which could reasonably go above the limits discussed. We also do not store fill information in the compact version of the node, since it is only used for sorted scrub. We also optimized the metaslab loading process in various other ways to offset some inefficiencies in the btree model. While individual operations (find, insert, remove_from) are faster for the btree than they are for the avl tree, remove usually requires a find operation, while in the AVL tree model the element itself suffices. Some clever changes actually caused an overall speedup in metaslab loading; we use approximately 40% less cpu to load metaslabs in our tests on Illumos. Another memory and performance optimization was achieved by changing what is stored in the size-sorted trees. When a disk is heavily fragmented, the df algorithm used by default in ZFS will almost always find a number of small regions in its initial cursor-based search; it will usually only fall back to the size-sorted tree to find larger regions. If we increase the size of the cursor-based search slightly, and don't store segments that are smaller than a tunable size floor in the size-sorted tree, we can further cut memory usage down to below 20% of what the AVL trees store. This also results in further reductions in CPU time spent loading metaslabs. The 16KiB size floor was chosen because it results in substantial memory usage reduction while not usually resulting in situations where we can't find an appropriate chunk with the cursor and are forced to use an oversized chunk from the size-sorted tree. In addition, even if we do have to use an oversized chunk from the size-sorted tree, the chunk would be too small to use for ZIL allocations, so it isn't as big of a loss as it might otherwise be. And often, more small allocations will follow the initial one, and the cursor search will now find the remainder of the chunk we didn't use all of and use it for subsequent allocations. Practical testing has shown little or no change in fragmentation as a result of this change. If the size-sorted tree becomes empty while the offset sorted one still has entries, it will load all the entries from the offset sorted tree and disregard the size floor until it is unloaded again. This operation occurs rarely with the default setting, only on incredibly thoroughly fragmented pools. There are some other small changes to zdb to teach it to handle btrees, but nothing major. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed by: Sebastien Roy seb@delphix.com Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9181	2019-10-09 10:36:03 -07:00
Matthew Macy	2cc479d049	Rename rangelock_ functions to zfs_rangelock_ A rangelock KPI already exists on FreeBSD. Add a zfs_ prefix as per our convention to prevent any conflict with existing symbols. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9402	2019-10-03 15:54:29 -07:00
Matthew Macy	73cdcc6323	OpenZFS restructuring - libzfs Factor Linux specific functionality out of libzfs. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Macy <mmacy@FreeBSD.org> Closes #9377	2019-10-03 10:33:16 -07:00
Matthew Macy	7c5eff9400	OpenZFS restructuring - libzutil Factor Linux specific functionality out of libzutil. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9356	2019-10-03 10:20:44 -07:00
Matthew Macy	6360e2779e	Add inode accessors to common code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9389	2019-10-02 09:15:12 -07:00
Matthew Macy	13a4027a7c	OpenZFS restructuring - arc_stats Make arc_stats visible to platform code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9386	2019-10-01 16:35:05 -07:00
Matthew Macy	3283f137d7	OpenZFS restructuring - zpool Factor Linux specific functions out of the zpool command. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9333	2019-09-30 12:16:06 -07:00
Matthew Macy	7bb0c29468	OpenZFS restructuring - zfs_ioctl Refactor the zfs ioctls in to platform dependent and independent bits. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Signed-off-by: Matthew Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9301	2019-09-27 10:46:28 -07:00
Tom Caputi	bb61cc3185	Fix encryption hierarchy issues with zfs recv -d Currently, the recv_fix_encryption_hierarchy() function accepts 'destsnap' as one of its parameters. Originally, this was intended to be the top-level dataset of a receive (whether or not the receive was recursive). Unfortunately, this parameter actually is simply the input that is passed in from the command line. When the user specifies 'zfs recv -d', this string is actually only the name of the receiving pool since the rest of the name is derived from the send stream. This causes the function to fail, leaving some datasets with an invalid encryption hierarchy. This patch resolves this problem by passing in the top_zfs variable instead. In order to make this work, this patch also includes some changes that ensure the value is always present when we need it. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9273 Closes #9309	2019-09-25 17:02:32 -07:00
Matthew Macy	5df7e9d85c	OpenZFS restructuring - zvol Refactor the zvol in to platform dependent and independent bits. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9295	2019-09-25 09:20:30 -07:00
John Gallagher	e60e158eff	Add subcommand to wait for background zfs activity to complete Currently the best way to wait for the completion of a long-running operation in a pool, like a scrub or device removal, is to poll 'zpool status' and parse its output, which is neither efficient nor convenient. This change adds a 'wait' subcommand to the zpool command. When invoked, 'zpool wait' will block until a specified type of background activity completes. Currently, this subcommand can wait for any of the following: - Scrubs or resilvers to complete - Devices to initialized - Devices to be replaced - Devices to be removed - Checkpoints to be discarded - Background freeing to complete For example, a scrub that is in progress could be waited for by running zpool wait -t scrub <pool> This also adds a -w flag to the attach, checkpoint, initialize, replace, remove, and scrub subcommands. When used, this flag makes the operations kicked off by these subcommands synchronous instead of asynchronous. This functionality is implemented using a new ioctl. The type of activity to wait for is provided as input to the ioctl, and the ioctl blocks until all activity of that type has completed. An ioctl was used over other methods of kernel-userspace communiction primarily for the sake of portability. Porting Notes: This is ported from Delphix OS change DLPX-44432. The following changes were made while porting: - Added ZoL-style ioctl input declaration. - Reorganized error handling in zpool_initialize in libzfs to integrate better with changes made for TRIM support. - Fixed check for whether a checkpoint discard is in progress. Previously it also waited if the pool had a checkpoint, instead of just if a checkpoint was being discarded. - Exposed zfs_initialize_chunk_size as a ZoL-style tunable. - Updated more existing tests to make use of new 'zpool wait' functionality, tests that don't exist in Delphix OS. - Used existing ZoL tunable zfs_scan_suspend_progress, together with zinject, in place of a new tunable zfs_scan_max_blks_per_txg. - Added support for a non-integral interval argument to zpool wait. Future work: ZoL has support for trimming devices, which Delphix OS does not. In the future, 'zpool wait' could be extended to add the ability to wait for trim operations to complete. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Gallagher <john.gallagher@delphix.com> Closes #9162	2019-09-13 18:09:06 -07:00
Chengfei ZHu	7238cbd4d3	QAT related bug fixes 1. Fix issue: Kernel BUG with QAT during decompression #9276. Now it is uninterruptible for a specific given QAT request, but Ctrl-C interrupt still works in user-space process. 2. Copy the digest result to the buffer only when doing encryption, and vise-versa for decryption. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com> Closes #9276 Closes #9303	2019-09-12 13:33:44 -07:00
Matthew Macy	b01a6574ae	Move objnode handling to common code objnode is OS agnostic and used only by dmu_redact.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9315	2019-09-12 13:31:09 -07:00
Matthew Macy	74756182d2	Enable compiler to typecheck logging Annotate spa logging declarations with printflike Workaround gcc bug (non disable-able warning) by replacing "" with " " Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9316	2019-09-12 13:28:26 -07:00
Matthew Macy	d66620681d	OpenZFS restructuring - move linux tracing code to platform directories Move Linux specific tracing headers and source to platform directories and update the build system. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9290	2019-09-11 14:25:53 -07:00
Brian Behlendorf	25f06d677a	Fix /etc/hostid on root pool deadlock Accidentally introduced by `dc04a8c` which now takes the SCL_VDEV lock as a reader in zfs_blkptr_verify(). A deadlock can occur if the /etc/hostid file resides on a dataset in the same pool. This is because reading the /etc/hostid file may occur while the caller is holding the SCL_VDEV lock as a writer. For example, to perform a `zpool attach` as shown in the abbreviated stack below. To resolve the issue we cache the system's hostid when initializing the spa_t, or when modifying the multihost property. The cached value is then relied upon for subsequent accesses. Call Trace: spa_config_enter+0x1e8/0x350 [zfs] zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock zio_read+0x6c/0x140 [zfs] ... vfs_read+0xfc/0x1e0 kernel_read+0x50/0x90 ... spa_get_hostid+0x1c/0x38 [zfs] spa_config_generate+0x1a0/0x610 [zfs] vdev_label_init+0xa0/0xc80 [zfs] vdev_create+0x98/0xe0 [zfs] spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9256 Closes #9285	2019-09-10 13:42:30 -07:00
Brian Behlendorf	b88ca2acf5	Enable SIMD for encryption When adding the SIMD compatibility code in `e5db313` the decryption of a dataset wrapping key was left in a user thread context. This was done intentionally since it's a relatively infrequent operation. However, this also meant that the encryption context templates were initialized using the generic operations. Therefore, subsequent encryption and decryption operations would use the generic implementation even when executed by an I/O pipeline thread. Resolve the issue by initializing the context templates in an I/O pipeline thread. And by updating zio_do_crypt_uio() to dispatch any encryption operations to a pipeline thread when called from the user context. For example, when performing a read from the ARC. Tested-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9215 Closes #9296	2019-09-10 10:45:46 -07:00
Matthew Macy	bced7e3aaa	OpenZFS restructuring - move platform specific sources Move platform specific Linux source under module/os/linux/ and update the build system accordingly. Additional code restructuring will follow to make the common code fully portable. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Macy <mmacy@FreeBSD.org> Closes #9206	2019-09-06 11:26:26 -07:00
Matthew Macy	03fdcb9adc	Make module tunables cross platform Adds ZFS_MODULE_PARAM to abstract module parameter setting to operating systems other than Linux. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9230	2019-09-05 14:49:49 -07:00
Matthew Macy	006e9a4088	OpenZFS restructuring - move platform specific headers Move platform specific Linux headers under include/os/linux/. Update the build system accordingly to detect the platform. This lays some of the initial groundwork to supporting building for other platforms. As part of this change it was necessary to create both a user and kernel space sys/simd.h header which can be included in either context. No functional change, the source has been refactored and the relevant #include's updated. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Matthew Macy <mmacy@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9198	2019-09-05 09:34:54 -07:00
Andrea Gelmini	cf7c5a030e	Fix typos in include/ Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #9238	2019-08-30 09:53:15 -07:00
Paul Dagnelie	eef0f4d84e	Keep more metaslabs loaded With the other metaslab changes loaded onto a system, we can significantly reduce the memory usage of each loaded metaslab and unload them on demand if there is memory pressure. However, none of those changes actually result in us keeping more metaslabs loaded. If we don't keep more metaslabs loaded, we will still have to wait for demand-loading to finish when no loaded metaslab can satisfy our allocation, which can cause ZIL performance issues. In addition, performance is traditionally measured by IOs per unit time, while unloading is currently done on a txg-count basis. Txgs can take a widely varying range of times, from tenths of a second to several seconds. This can result in confusing, hard to predict behavior. This change simply adds a time-based component to metaslab unloading. A metaslab will remain loaded for one minute and 8 txgs (by default) after it was last used, unless it is evicted due to memory pressure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-65016 External-issue: DLPX-65047 Closes #9197	2019-08-29 10:20:36 -07:00
Chunwei Chen	035e96118b	Fix zil replay panic when TX_REMOVE followed by TX_CREATE If TX_REMOVE is followed by TX_CREATE on the same object id, we need to make sure the object removal is completely finished before creation. The current implementation relies on dnode_hold_impl with DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work fine before, in current version it does not guarantee the object removal is completed. We fix this by checking if DNODE_MUST_BE_FREE returns successful instead. Also add test and remove dead code in dnode_hold_impl. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7151 Closes #8910 Closes #9123 Closes #9145	2019-08-28 10:42:02 -07:00
Tom Caputi	e7a2fa70c3	Fix deadlock in 'zfs rollback' Currently, the 'zfs rollback' code can end up deadlocked due to the way the kernel handles unreferenced inodes on a suspended fs. Essentially, the zfs_resume_fs() code path may cause zfs to spawn new threads as it reinstantiates the suspended fs's zil. When a new thread is spawned, the kernel may attempt to free memory for that thread by freeing some unreferenced inodes. If it happens to select inodes that are a a part of the suspended fs a deadlock will occur because freeing inodes requires holding the fs's z_teardown_inactive_lock which is still held from the suspend. This patch corrects this issue by adding an additional reference to all inodes that are still present when a suspend is initiated. This prevents them from being freed by the kernel for any reason. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9203	2019-08-27 09:55:51 -07:00
Matthew Ahrens	325d288c5d	Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from `zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting in poor performance because we are holding the dp_config_rwlock the entire time, blocking spa_sync() from continuing. With around ten thousand snapshots, we've seen up to 500 seconds in this ioctl, iterating over up to 50,000,000 bpobjs, ~99% of which are the empty bpobj. By creating a fast path for zfs_ioc_space_snaps() handling of the empty_bpobj, we can achieve a ~5x performance improvement of this ioctl (when there are many snapshots, and the deadlist is mostly empty_bpobj's). Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58348 Closes #8744	2019-08-20 11:34:52 -07:00
Paul Dagnelie	f09fda5071	Cap metaslab memory usage On systems with large amounts of storage and high fragmentation, a huge amount of space can be used by storing metaslab range trees. Since metaslabs are only unloaded during a txg sync, and only if they have been inactive for 8 txgs, it is possible to get into a state where all of the system's memory is consumed by range trees and metaslabs, and txgs cannot sync. While ZFS knows how to evict ARC data when needed, it has no such mechanism for range tree data. This can result in boot hangs for some system configurations. First, we add the ability to unload metaslabs outside of syncing context. Second, we store a multilist of all loaded metaslabs, sorted by their selection txg, so we can quickly identify the oldest metaslabs. We use a multilist to reduce lock contention during heavy write workloads. Finally, we add logic that will unload a metaslab when we're loading a new metaslab, if we're using more than a certain fraction of the available memory on range trees. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9128	2019-08-16 09:08:21 -06:00
Paul Dagnelie	dc04a8c757	Prevent race in blkptr_verify against device removal When we check the vdev of the blkptr in zfs_blkptr_verify, we can run into a race condition where that vdev is temporarily unavailable. This happens when a device removal operation and the old vdev_t has been removed from the array, but the new indirect vdev has not yet been inserted. We hold the spa_config_lock while doing our sensitive verification. To ensure that we don't deadlock, we only grab the lock if we don't have config_writer held. In addition, I had to const the tags of the refcounts and the spa_config_lock arguments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9112	2019-08-13 21:24:43 -06:00
Chunwei Chen	8e556c5ebc	Fix out-of-order ZIL txtype lost on hardlinked files We should only call zil_remove_async when an object is removed. However, in current implementation, it is called whenever TX_REMOVE is called. In the case of hardlinked file, every unlink will generate TX_REMOVE and causing operations to be dropped even when the object is not removed. We fix this by only calling zil_remove_async when the file is fully unlinked. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #8769 Closes #9061	2019-08-13 21:21:27 -06:00
George Wilson	c8242a96ba	spa_load_verify() may consume too much memory When a pool is imported it will scan the pool to verify the integrity of the data and metadata. The amount it scans will depend on the import flags provided. On systems with small amounts of memory or when importing a pool from the crash kernel, it's possible for spa_load_verify to issue too many I/Os that it consumes all the memory of the system resulting in an OOM message or a hang. To prevent this, we limit the amount of memory that the initial pool scan can consume. This change will, by default, use 1/16th of the ARC for scan I/Os to prevent running the system out of memory during import. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: George Wilson george.wilson@delphix.com External-issue: DLPX-65237 External-issue: DLPX-65238 Closes #9146	2019-08-13 08:11:57 -06:00
Tomohiro Kusumi	a43570c5f3	Change boolean-like uint8_t fields in znode_t to boolean_t Given znode_t is an in-core structure, it's more readable to have them as boolean. Also co-locate existing boolean fields with them for space efficiency (expecting 8 booleans to be packed/aligned). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9092	2019-08-13 07:58:02 -06:00
Richard Yao	fccbd1d6e2	Drop KMC_NOEMERGENCY This is not implemented. If it were implemented, using it would risk deadlocks on pre-3.18 kernels. Lets just drop it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #9119	2019-08-13 07:46:12 -06:00
Paul Dagnelie	c81f1790e2	Metaslab max_size should be persisted while unloaded When we unload metaslabs today in ZFS, the cached max_size value is discarded. We instead use the histogram to determine whether or not we think we can satisfy an allocation from the metaslab. This can result in situations where, if we're doing I/Os of a size not aligned to a histogram bucket, a metaslab is loaded even though it cannot satisfy the allocation we think it can. For example, a metaslab with 16 entries in the 16k-32k bucket may have entirely 16kB entries. If we try to allocate a 24kB buffer, we will load that metaslab because we think it should be able to handle the allocation. Doing so is expensive in CPU time, disk reads, and average IO latency. This is exacerbated if the write being attempted is a sync write. This change makes ZFS cache the max_size after the metaslab is unloaded. If we ever get a free (or a coalesced group of frees) larger than the max_size, we will update it. Otherwise, we leave it as is. When attempting to allocate, we use the max_size as a lower bound, and respect it unless we are in try_hard. However, we do age the max_size out at some point, since we expect the actual max_size to increase as we do more frees. A more sophisticated algorithm here might be helpful, but this works reasonably well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9055	2019-08-05 14:34:27 -07:00
Sara Hartse	37f03da8ba	Fast Clone Deletion Deleting a clone requires finding blocks are clone-only, not shared with the snapshot. This was done by traversing the entire block tree which results in a large performance penalty for sparsely written clones. This is new method keeps track of clone blocks when they are modified in a "Livelist" so that, when it’s time to delete, the clone-specific blocks are already at hand. We see performance improvements because now deletion work is proportional to the number of clone-modified blocks, not the size of the original dataset. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Sara Hartse <sara.hartse@delphix.com> Closes #8416	2019-07-26 10:54:14 -07:00
Matthew Ahrens	1ff46825e2	Replace zf_rwlock with a mutex The rwlock implementation on linux does not perform as well as mutexes. We can realize a performance benefit by replacing the zf_rwlock with a mutex. Local microbenchmarks show ~50% improvement, and over NFS we see ~5% improvement on several of the ZFS Performance Tests cases, especially randwrite and seq_write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9062	2019-07-25 11:57:58 -07:00
Serapheim Dimitropoulos	43a8536260	Race condition between spa async threads and export In the past we've seen multiple race conditions that have to do with open-context threads async threads and concurrent calls to spa_export()/spa_destroy() (including the one referenced in issue #9015). This patch ensures that only one thread can execute the main body of spa_export_common() at a time, with subsequent threads returning with a new error code created just for this situation, eliminating this way any race condition bugs introduced by concurrent calls to this function. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9015 Closes #9044	2019-07-18 13:02:33 -07:00
Brian Behlendorf	d64dd3b62a	Retire unused spl_{mutex,rwlock}_{init_fini} These functions are unused and can be removed along with the spl-mutex.c and spl-rwlock.c source files. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-07-17 15:13:53 -07:00
Brian Behlendorf	e7a99dab2b	Linux 5.3 compat: retire rw_tryupgrade() The Linux kernel's rwsem's have never provided an interface to allow a reader to be upgraded to a writer. Historically, this functionality has been implemented by a SPL wrapper function. However, this approach depends on internal knowledge of the rw_semaphore and is therefore rather brittle. Since the ZFS code must always be able to fallback to rw_exit() and rw_enter() when an rw_tryupgrade() fails; this functionality isn't critical. Furthermore, the only potentially performance sensitive consumer is dmu_zfetch() and no decrease in performance was observed with this change applied. See the PR comments for additional testing details. Therefore, it is being retired to make the build more robust and to simplify the rwlock implementation. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-07-17 15:08:56 -07:00
Brian Behlendorf	041205afee	Linux 5.3 compat: rw_semaphore owner Commit https://github.com/torvalds/linux/commit/94a9717b updated the rwsem's owner field to contain additional flags describing the rwsem's state. Rather then update the wrappers to mask out these bits, the code no longer relies on the owner stored by the kernel. This does increase the size of a krwlock_t but it makes the implementation less sensitive to future kernel changes. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9029	2019-07-17 15:07:46 -07:00
jdike	a649768a17	Fix lockdep recursive locking false positive in dbuf_destroy lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] * DEADLOCK * [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #8984	2019-07-17 09:18:24 -07:00
Brian Behlendorf	095b5412b3	Fix CONFIG_X86_DEBUG_FPU build failure When CONFIG_X86_DEBUG_FPU is defined the alternatives_patched symbol is pulled in as a dependency which results in a build failure. To prevent this undefine CONFIG_X86_DEBUG_FPU to disable the WARN_ON_FPU() macro and rely on WARN_ON_ONCE debugging checks which were previously added. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9041 Closes #9049	2019-07-17 09:14:36 -07:00
Brian Behlendorf	8062b7686a	Minor style cleanup Resolve an assortment of style inconsistencies including use of white space, typos, capitalization, and line wrapping. There is no functional change. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9030	2019-07-16 17:22:31 -07:00
Serapheim Dimitropoulos	93e28d661e	Log Spacemap Project = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent `db587941c5` Update vdev_is_spacemap_addressable() for new spacemap encoding `419ba59145` Simplify spa_sync by breaking it up to smaller functions `8dc2197b7b` Factor metaslab_load_wait() in metaslab_load() `b194fab0fb` Rename range_tree_verify to range_tree_verify_not_present `df72b8bebe` Change target size of metaslabs from 256GB to 16GB `c853f382db` zdb -L should skip leak detection altogether `21e7cf5da8` vs_alloc can underflow in L2ARC vdevs `7558997d2f` Simplify log vdev removal code `6c926f426a` Get rid of space_map_update() for ms_synced_length `425d3237ee` Introduce auxiliary metaslab histograms `928e8ad47d` Error path in metaslab_load_impl() forgets to drop ms_sync_lock `8eef997679` = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8442	2019-07-16 10:11:49 -07:00
Tomohiro Kusumi	ff9630d1a8	Disable unused pathname::pn_path* (unneeded in Linux) struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9025	2019-07-15 13:57:56 -07:00
Brian Behlendorf	e5db313494	Linux 5.0 compat: SIMD compatibility Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This is accomplished by leveraging the fact that by definition dedicated kernel threads never need to concern themselves with saving and restoring the user FPU state. Therefore, they may use the FPU as long as we can guarantee user tasks always restore their FPU state before context switching back to user space. For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. For 5.2 and latter kernels the user FPU state restoration will be skipped if the kernel determines the registers have not changed. Therefore, for these kernels we need to perform the additional step of saving and restoring the FPU registers. Invalidating the per-cpu global tracking the FPU state would force a restore but that functionality is private to the core x86 FPU implementation and unavailable. In practice, restricting SIMD to kernel threads is not a major restriction for ZFS. The vast majority of SIMD operations are already performed by the IO pipeline. The remaining cases are relatively infrequent and can be handled by the generic code without significant impact. The two most noteworthy cases are: 1) Decrypting the wrapping key for an encrypted dataset, i.e. `zfs load-key`. All other encryption and decryption operations will use the SIMD optimized implementations. 2) Generating the payload checksums for a `zfs send` stream. In order to avoid making any changes to the higher layers of ZFS all of the `*_get_ops()` functions were updated to take in to consideration the calling context. This allows for the fastest implementation to be used as appropriate (see kfpu_allowed()). The only other notable instance of SIMD operations being used outside a kernel thread was at module load time. This code was moved in to a taskq in order to accommodate the new kernel thread restriction. Finally, a few other modifications were made in order to further harden this code and facilitate testing. They include updating each implementations operations structure to be declared as a constant. And allowing "cycle" to be set when selecting the preferred ops in the kernel as well as user space. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8754 Closes #8793 Closes #8965	2019-07-12 09:31:20 -07:00
Paul Dagnelie	f664f1ee7f	Decrease contention on dn_struct_rwlock Currently, sequential async write workloads spend a lot of time contending on the dn_struct_rwlock. This lock is responsible for protecting the entire block tree below it; this naturally results in some serialization during heavy write workloads. This can be resolved by having per-dbuf locking, which will allow multiple writers in the same object at the same time. We introduce a new rwlock, the db_rwlock. This lock is responsible for protecting the contents of the dbuf that it is a part of; when reading a block pointer from a dbuf, you hold the lock as a reader. When writing data to a dbuf, you hold it as a writer. This allows multiple threads to write to different parts of a file at the same time. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens matt@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-52564 External-issue: DLPX-53085 External-issue: DLPX-57384 Closes #8946	2019-07-08 13:18:50 -07:00
Brad Lewis	cb70964221	8659 static dtrace probes unavailable on non-GPL modules ZFS tracing efforts are hampered by the inability to access zfs static probes(probes using DTRACE_PROBE macros). The probes are available via tracepoints for GPL modules only. The build could be modified to generate a function for each unique DTRACE_PROBE invocation. These could be then accessed via kprobes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brad Lewis <brad.lewis@delphix.com> Closes #8659 Closes #8663	2019-07-08 11:20:53 -07:00
loli10K	3b5fe2c351	Fix zfs "redact" misc issues * zfs redact error messages do not end with newline character * `30af21b0` inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8988	2019-07-05 16:38:17 -07:00
Mike Gerdts	341166c843	OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocks When a volume is created in a pool with raidz vdevs and volblocksize != 128k, the volume can reference more space than is reserved with the automatically calculated refreservation. There are two deficiencies in vol_volsize_to_reservation that contribute to this: 1) Skip blocks may be added to keep each allocation a multiple of parity + 1. This is the dominating factor when volblocksize is close to 2^ashift. 2) raidz deflation for 128 KB blocks is different for most other block sizes. See "The theory of raidz space accounting" comment in libzfs_dataset.c for a full explanation. Authored by: Mike Gerdts <mike.gerdts@joyent.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Kody Kantor <kody.kantor@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Mike Gerdts <mike.gerdts@joyent.com> Porting Notes: * ZTS: wait for zvols to exist before writing * ZTS: use log_must_busy with {zpool\|zfs} destroy OpenZFS-issue: https://www.illumos.org/issues/9318 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0 Closes #8973	2019-07-05 15:35:15 -07:00
Alexander Motin	fc7546777b	Avoid extra taskq_dispatch() calls by DMU DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8909	2019-06-25 12:03:38 -07:00
loli10K	746d4a451e	Fix bp_embedded_type enum definition With the addition of BP_EMBEDDED_TYPE_REDACTED in `30af21b0` a couple of codepaths make wrong assumptions and could potentially result in errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8951	2019-06-24 18:02:17 -07:00
Matthew Ahrens	59ec30a329	Remove code for zfs remap The "zfs remap" command was disabled by `6e91a72fe3`, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8944	2019-06-24 16:44:01 -07:00
Don Brady	186898bbb5	OpenZFS 9425 - channel programs can be interrupted Problem Statement ================= ZFS Channel program scripts currently require a timeout, so that hung or long-running scripts return a timeout error instead of causing ZFS to get wedged. This limit can currently be set up to 100 million Lua instructions. Even with a limit in place, it would be desirable to have a sys admin (support engineer) be able to cancel a script that is taking a long time. Proposed Solution ================= Make it possible to abort a channel program by sending an interrupt signal.In the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to catch the signal. Once a signal is encountered, the dsl_sync_task function can install a Lua hook that will get called before the Lua interpreter executes a new line of code. The dsl_sync_task can resume with a standard txg_wait_sync call and wait for the txg to complete. Meanwhile, the hook will abort the script and indicate that the channel program was canceled. The kernel returns a EINTR to indicate that the channel program run was canceled. Porting notes: Added missing return value from cv_wait_sig() Authored by: Don Brady <don.brady@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/9425 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926 Closes #8904	2019-06-22 16:51:46 -07:00
Paul Dagnelie	2b09628b59	Fix comments on zfs_bookmark_phys Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8945	2019-06-22 16:32:26 -07:00
Tom Caputi	da68988708	Allow unencrypted children of encrypted datasets When encryption was first added to ZFS, we made a decision to prevent users from creating unencrypted children of encrypted datasets. The idea was to prevent users from inadvertently leaving some of their data unencrypted. However, since the release of 0.8.0, some legitimate reasons have been brought up for this behavior to be allowed. This patch simply removes this limitation from all code paths that had checks for it and updates the tests accordingly. Reviewed-by: Jason King <jason.king@joyent.com> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8737 Closes #8870	2019-06-20 12:29:51 -07:00
Matthew Ahrens	050d720c43	Remove dedupditto functionality If dedup is in use, the `dedupditto` property can be set, causing ZFS to keep an extra copy of data that is referenced many times (>100x). The idea was that this data is more important than other data and thus we want to be really sure that it is not lost if the disk experiences a small amount of random corruption. ZFS (and system administrators) rely on the pool-level redundancy to protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin doesn't have control over what data will be offered extra redundancy by dedupditto, this extra redundancy is not very useful. The bulk of the data is still vulnerable to loss based on the pool-level redundancy. For example, if particle strikes corrupt 0.1% of blocks, you will either be saved by mirror/raidz, or you will be sad. This is true even if dedupditto saved another 0.01% of blocks from being corrupted. Therefore, the dedupditto functionality is rarely enabled (i.e. the property is rarely set), and it fulfills its promise of increased redundancy even more rarely. Additionally, this feature does not work as advertised (on existing releases), because scrub/resilver did not repair the extra (dedupditto) copy (see https://github.com/zfsonlinux/zfs/pull/8270). In summary, this seldom-used feature doesn't work, and even if it did it wouldn't provide useful data protection. It has a non-trivial maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270). We should remove the dedupditto functionality. For backwards compatibility with the existing CLI, "zpool set dedupditto" will still "succeed" (exit code zero), but won't have any effect. For backwards compatibility with existing pools that had dedupditto enabled at some point, the code will still be able to understand dedupditto blocks and free them when appropriate. However, ZFS won't write any new dedupditto blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Issue #8270 Closes #8310	2019-06-19 14:54:02 -07:00
Paul Dagnelie	30af21b025	Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958	2019-06-19 09:48:12 -07:00
Matthew Ahrens	7218b29e4b	lz4_decompress_abd declared but not defined `lz4_decompress_abd` is declared in zio_compress.h but it is not defined anywhere. The declaration should be removed. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-47477 Closes #8894	2019-06-13 13:14:34 -07:00
Matthew Ahrens	53dce5acc6	panic in removal_remap test on 4K devices If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-61342 Closes #8893	2019-06-13 13:12:39 -07:00
Tulsi Jain	9c7da9a95a	Restrict filesystem creation if name referred either '.' or '..' This change restricts filesystem creation if the given name contains either '.' or '..' Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: TulsiJain <tulsi.jain@delphix.com> Closes #8842 Closes #8564	2019-06-13 08:56:15 -07:00
Matthew Ahrens	d9b4bf0665	fat zap should prefetch when iterating When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58347 Closes #8862	2019-06-12 13:13:09 -07:00
Matthew Ahrens	5662fd5794	single-chunk scatter ABDs can be treated as linear Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by `87c25d567`. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8580	2019-06-11 09:02:31 -07:00
Matthew Ahrens	b8738257c2	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2019-06-10 11:48:42 -07:00
Paul Dagnelie	893a6d62c1	Allow metaslab to be unloaded even when not freed from On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8837	2019-06-06 19:10:43 -07:00
Tomohiro Kusumi	fe0c9f409a	Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod) Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8826	2019-05-29 16:18:14 -07:00
loli10K	d28b492ab3	VERIFY3P() message is missing a space character This commit just reintroduces a [space] character inadvertently removed in `a887d653`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8786	2019-05-24 14:06:53 -07:00
Tomohiro Kusumi	8708fd888f	Linux 2.6.39 compat: Test if kstrtoul() exists kstrtoul() exists only after torvalds/linux@33ee3b2e2e in 2.6.39. Use strict_strtoul() if kstrtoul() doesn't exist. Note that strict_strtoul() has existed as an alias for kstrtoul() for a while, but removed in torvalds/linux@3db2e9cdc0. It looks like RHEL6 (2.6.32 based) has backported kstrtoul(), and this caused build CI to pass compilation test. It should fail on vanilla < 2.6.39 kernels or distro kernels without backport as reported in #8760. -- # grep "kstrtoul(" /lib/modules/2.6.32-754.12.1.el6.x86_64/build/ \ include/linux/kernel.h >/dev/null # echo $? 0 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8760 Closes #8761	2019-05-24 12:26:18 -07:00
Rafael Kitover	8b8b44d06f	kernel timer API rework In `config/kernel-timer.m4` refactor slightly to check more generally for the new `timer_setup()` APIs, but also check the callback signature because some kernels (notably 4.14) have the new `timer_setup()` API but use the old callback signature. Also add a check for a `flags` member in `struct timer_list`, which was added in 4.1-rc8. Add compatibility shims to `include/spl/sys/timer.h` to allow using the new timer APIs with the only two caveats being that the callback argument type must be declared as `spl_timer_list_t` and an explicit assignment is required to get the timer variable for the `timer_of()` macro. So the callback would look like this: ```c __cv_wakeup(spl_timer_list_t t) { struct timer_list tmr = (struct timer_list )t; struct thing parent = from_timer(parent, tmr, parent_timer_field); ... / do stuff with parent */ ``` Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the new timer APIs instead of conditional code. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rafael Kitover <rkitover@gmail.com> Closes #8647	2019-05-23 14:40:28 -07:00
Brian Behlendorf	bff2361aeb	Linux 5.2 compat: rw_tryupgrade() Commit torvalds/linux@46ad0840b has removed the architecture specific rwsem source and headers leaving only the generic version. As part of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS macros were moved to the private kernel/locking/rwsem.h header. This results in a build failure because these macros were required to implement the rw_tryupgrade() compatibility function. In practice, this isn't a major problem because there are only a few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade should be written to retry using rw_enter(RW_WRITER). After auditing all of the callers only dmu_zfetch() was determined not to perform a retry. It has been updated in this commit to resolve this issue. That said, the rw_tryupgrade() functionality should be considered for possible removal in a future release due to the difficultly in supporting the interface. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8730	2019-05-23 13:46:33 -07:00
Olaf Faaland	ca95f70dff	zpool import progress kstat When an import requires a long MMP activity check, or when the user requests pool recovery, the import make take a long time. The user may not know why, or be able to tell whether the import is progressing or is hung. Add a kstat which lists all imports currently being processed by the kernel (currently only one at a time is possible, but the kstat allows for more than one). The kstat is /proc/spl/kstat/zfs/import_progress. The kstat contents are as follows: pool_guid load_state multihost_secs max_txg pool_name 16667015954387398 3 15 0 tank3 load_state: the value of spa_load_state multihost_secs: seconds until the end of the multihost activity check; if over, or none required, this is 0 max_txg: current spa_load_max_txg, if rewind is occurring This could be used by outside tools, such as a pacemaker resource agent, to report import progress, or as a part of manual troubleshooting. The zpool import subcommand could also be modified to report this information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #8696	2019-05-09 10:08:05 -07:00
Brian Behlendorf	caf9dd209f	Fix send/recv lost spill block When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8668	2019-05-07 15:18:44 -07:00
Tomohiro Kusumi	9c53e51616	Fix `zfs set atime\|relatime=off\|on` behavior on inherited datasets `zfs set atime\|relatime=off\|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime\|relatime=off\|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8674 Closes #8675	2019-05-07 10:06:30 -07:00
Tomohiro Kusumi	75346937de	Linux 5.1 compat: Drop ULLONG_MAX and LLONG_MAX definitions Linux kernel commit 54d50897d544c874562253e2a8f70dfcad22afe8 "linux/kernel.h: split _MAX and _MIN macros into <linux/limits.h>" which first appeared in 5.1 has moved several macros from <linux/kernel.h> to <linux/limits.h>. This broke compilation due to header inclusion order against the local header include/spl/sys/types.h which also defines ULLONG_MAX and LLONG_MAX if undefined. It looks like local ULLONG_MAX and LLONG_MAX were never needed (or after spl integration ?) as <linux/kernel.h> has had the same definitions since an upstream commit 111ebb6e6f7bd7de6d722c5848e95621f43700d9 in 2.6.18, so drop them. -- linux/include/linux/limits.h:17: error: "LLONG_MAX" redefined [-Werror] #define LLONG_MAX ((long long)(~0ULL >> 1)) zfs/include/spl/sys/types.h:35: note: this is the location of the previous definition #define LLONG_MAX ((long long)(~0ULL>>1)) Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8714	2019-05-07 09:55:40 -07:00
Tomohiro Kusumi	de3e0b914b	Linux 5.0 compat: Use totalhigh_pages() Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839 "mm: convert totalram_pages and totalhigh_pages variables to atomic" replaced `totalhigh_pages` with an inline function `totalhigh_pages()`. This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages` on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8677 Closes #8701	2019-05-04 16:40:48 -07:00
Tom Caputi	fa24166074	Add feature check for 'zpool resilver' command The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8700	2019-05-02 16:42:31 -07:00
Matthew Ahrens	6bdefad311	Remove incorrect (and inappropriate) comment in dprintf_dnode This comment seems to misunderstand the ## preprocessor token, which does token concatenation. It is not needed here, since we are concatenating string literals, which is performed by putting the literals next to each other. Additionally, the comment uses offensive language. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8698 Closes #8699	2019-05-01 17:32:54 -07:00
TerraTech	50478c6dad	Add option [-V\|--version] to emit version string Add the 'zfs version' and 'zpool version' subcommands to display the version of the user space utilities and loaded zfs kernel module. For example: $ zfs version zfs-0.8.0-rc3_169_g67e0366b88 zfs-kmod-0.8.0-rc3_169_g67e0366b88 The '-V' and '--version' aliases were added to support the common convention of using 'zfs --version` to obtain the version information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com> Closes #2501 Closes #8567	2019-04-16 12:24:06 -07:00
Richard Laager	8dda07b33c	Reference zfeature.c in a SPA_VERSION comment Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:02:19 -07:00
Richard Laager	7698c4eca9	Remove zfs.h comments about GRUB Nobody is going to be bumping SPA_VERSION again, as OpenZFS has moved on to feature flags. Also, there is no requirement to keep GRUB up-to-date, nor has that been happening. The ZPL_VERSION could be bumped, but that would likely be handled in a similar way, by adding filesystem feature flags. In any event, we do not need this comment, and we certainly don't need a reference to the GRUB 0.97 source code in a Solaris tree. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:02:14 -07:00
Tomohiro Kusumi	96e51d2773	Sync reserved Illumos ioctl comment with actual number It's 81 now. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8598	2019-04-14 11:12:07 -07:00
Tomohiro Kusumi	9a65234c8b	Unbreak build on Linux kernel < 3.10 d12614521a("Fixes for procfs files backed by linked lists") uses PDE_DATA(), but since PDE_DATA() (public interface which replaced old public interface PDE()) first appeared in upstream kernel 3.10, it lacks visible local definition for kernel < 3.10. Move the local PDE_DATA() definition to a ZoL header, to unbreak build on kernel < 3.10. -- module/spl/spl-procfs-list.c: In function 'procfs_list_open': module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA' module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8599	2019-04-08 14:59:24 -07:00
Sara Hartse	a887d653b3	Restrict kstats and print real pointers There are several places where we use zfs_dbgmsg and %p to print pointers. In the Linux kernel, these values obfuscated to prevent information leaks which means the pointers aren't very useful for debugging crash dumps. We decided to restrict the permissions of dbgmsg (and some other kstats while we were at it) and print pointers with %px in zfs_dbgmsg as well as spl_dumpstack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> Closes #8467 Closes #8476	2019-04-04 18:57:06 -07:00
Brian Behlendorf	1b939560be	Add TRIM support UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Contributions-by: Tim Chase <tim@chase2k.com> Contributions-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8419 Closes #598	2019-03-29 09:13:20 -07:00
George Wilson	2efea7c82c	ZFS Reads may result in unneccesary calls to zil_commit ZFS supports O_RSYNC for read operations and when specified will ensure the same level of data integrity that O_DSYNC and O_SYNC provides for writes. O_RSYNC by itself has no effect so it must be combined with either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect and causes unnecessary calls to zil_commit. Only platforms which support O_RSYNC should implement the zil_commit functionality in the read code path. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8523	2019-03-22 13:09:11 -07:00
Olaf Faaland	060f0226e6	MMP interval and fail_intervals in uberblock When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7842	2019-03-21 12:47:57 -07:00
Tom Caputi	ab7615d92c	Multiple DVA Scrubbing Fix Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8453	2019-03-15 14:14:31 -07:00
Tom Caputi	f00ab3f22c	Detect and prevent mixed raw and non-raw sends Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8308	2019-03-13 11:00:43 -07:00
Tom Caputi	579ce7c5ae	Add bookmark v2 on-disk feature This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8308	2019-03-13 10:58:39 -07:00
Tom Caputi	369aa501d1	Fix handling of maxblkid for raw sends Currently, the receive code can create an unreadable dataset from a correct raw send stream. This is because it is currently impossible to set maxblkid to a lower value without freeing the associated object. This means truncating files on the send side to a non-0 size could result in corruption. This patch solves this issue by adding a new 'force' flag to dnode_new_blkid() which will allow the raw receive code to force the DMU to accept the provided maxblkid even if it is a lower value than the existing one. For testing purposes the send_encrypted_files.ksh test has been extended to include a variety of truncated files and multiple snapshots. It also now leverages the xattrtest command to help ensure raw receives correctly handle xattrs. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8168 Closes #8487	2019-03-13 10:52:01 -07:00
Olaf Faaland	db2af93d72	Increase default zfs_multihost_fail_intervals and import_intervals By default, when multihost is enabled for a pool, the pool is suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms pass without a successful MMP write. This is the recommended configuration. The default value for zfs_multihost_fail_intervals has been 5, and the default value for zfs_multihost_interval has been 1000, so pool suspension occurred at 5 seconds. There have been multiple cases where a single misbehaving device in a pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This in turn caused MMP to suspend the pool. In the cases observed, the rest of the devices were healthy and the pool was otherwise correctly performing I/O. The reset was handled correctly by ZFS, and by suspending the pool MMP made replacing the device more difficult as well as forcing the host to be rebooted. Increase the default value of zfs_multihost_fail_intervals to 10, so that MMP tolerates up to 10 seconds of failed MMP writes before suspending the pool. Increase the default value of zfs_multihost_import_intervals to 20, to maintain the 2:1 safety factor. This results in a force import taking approximately 20 seconds when MMP is enabled, with default values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7709 Closes #8495	2019-03-13 09:50:48 -07:00
Alek P	4c0883fb4a	Avoid retrieving unused snapshot props This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8077	2019-03-12 13:13:22 -07:00
Olaf Faaland	3d31aad83e	MMP writes rotate over leaves Instead of choosing a leaf vdev quasi-randomly, by starting at the root vdev and randomly choosing children, rotate over leaves to issue MMP writes. This fixes an issue in a pool whose top-level vdevs have different numbers of leaves. The issue is that the frequency at which individual leaves are chosen for MMP writes is based not on the total number of leaves but based on how many siblings the leaves have. For example, in a pool like this: root-vdev +------+---------------+ vdev1 vdev2 \| \| \| +------+-----+-----+----+ disk1 disk2 disk3 disk4 disk5 disk6 vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1 is chosen, disk1 will be chosen. However, every time vdev2 is chosen, disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as many MMP writes as disk2. This may create wear issues in the case of SSDs. It also reduces the effectiveness of MMP as it depends on the writes being evenly distributed for the case where some devices fail or are partitioned. The new code maintains a list of leaf vdevs in the pool. MMP records the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list, continuing from the head if the tail is reached. It stops when a suitable leaf is found or all leaves have been examined. Added a test to verify MMP write distribution is even. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7953	2019-03-12 10:37:06 -07:00
Lorenz Brun	bf90948daf	Reorder ZFS ioctls to fix cross-version compatibility Reorder ZFS ioctls to fix cross-version compatibility. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Lorenz Brun <lorenz@dolansoft.org> Closes #8484	2019-03-09 13:39:31 -08:00
Tony Hutter	becdcec7b9	kernel_fpu fixes This patch fixes a few issues when detecting which kernel_fpu functions are available. - Use kernel_fpu_begin() if it's exported on newer kernels. - Use ZFS_LINUX_TRY_COMPILE_SYMBOL() to choose the right kernel_fpu function when using --enable-linux-builtin. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8259 Closes #8363	2019-03-06 16:03:03 -08:00
Paul Zuchowski	a73e8fdb93	Stack overflow in recursive bpobj_iterate_impl The function bpobj_iterate_impl overflows the stack when bpobjs are deeply nested. Rewrite the function to eliminate the recursion. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7674 Closes #7675 Closes #7908	2019-03-06 09:50:55 -08:00
lidongyang	8d9e51c084	Fix dnode_hold_impl() soft lockup Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyangli@ddn.com> Closes #8433	2019-02-22 09:48:37 -08:00
Serapheim Dimitropoulos	928e8ad47d	Introduce auxiliary metaslab histograms This patch introduces 3 new histograms per metaslab. These histograms track segments that have made it to the metaslab's space map histogram (and are part of the spacemap) but have not yet reached the ms_allocatable tree on loaded metaslab's because these metaslab's are currently syncing and haven't gone through metaslab_sync_done() yet. The histograms help when we decide whether to load an unloaded metaslab in-order to allocate from it. When calculating the weight of an unloaded metaslab traditionally, we look at the highest bucket of its spacemap's histogram. The problem is that we are not guaranteed to be able to allocated that segment when we load the metaslab because it may still be at the freeing, freed, or defer trees. The new histograms are used when we try to calculate an unloaded metaslab's weight to deal with this issue by removing segments that have would not be in the allocatable tree at runtime. Note, that this method of dealing with this is not completely accurate as adjacent segments are not always consolidated in the space map histogram of a metaslab. In addition and to make things deterministic, we always reset the weight of unloaded metaslabs based on their space map weight (instead of doing that on a need basis). Thus, every time a metaslab is loaded and its weight is reset again (from the weight based on its space map to the one based on its allocatable range tree) we expect (and assert) that this change in weight can only get better if it doesn't stay the same. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8358	2019-02-20 09:59:56 -08:00
Paul Zuchowski	9c5e88b1de	zfs should optionally send holds Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7513	2019-02-15 12:41:38 -08:00
Tony Hutter	e73ab1b38c	Linux 4.20 compat: Fix VERIFY(RW_READ_HELD(&hash->mh_contents)) The 4.20 kernel changed the meaning of the rw_semaphore.owner bits, causing an assertion when loading the module under the 4.20 kernel. This patch fixes the issue. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8360 Closes #8389	2019-02-15 12:37:20 -08:00
Alek P	dcec0a12c8	port async unlinked drain from illumos-nexenta This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8142	2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos	425d3237ee	Get rid of space_map_update() for ms_synced_length Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8328	2019-02-12 10:38:11 -08:00
loli10K	d8d418ff0c	ZVOLs should not be allowed to have children zfs create, receive and rename can bypass this hierarchy rule. Update both userland and kernel module to prevent this issue and use pyzfs unit tests to exercise the ioctls directly. Note: this commit slightly changes zfs_ioc_create() ABI. This allow to differentiate a generic error (EINVAL) from the specific case where we tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com>	2019-02-08 15:44:15 -08:00
Tony Hutter	0c593296e9	Linux 5.0 compat: Disable vector instructions on 5.0+ kernels The 5.0 kernel no longer exports the functions we need to do vector (SSE/SSE2/SSE3/AVX...) instructions. Disable vector-based checksum algorithms when building against those kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8259	2019-01-28 10:11:45 -08:00
Tony Hutter	031cea17a3	Linux 5.0 compat: Use totalram_pages() totalram_pages() was converted to an atomic variable in 5.0: https://patchwork.kernel.org/patch/10652795/ Its value should now be read though the totalram_pages() helper function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8263	2019-01-28 10:11:14 -08:00
Tony Hutter	77e50c3070	Linux 5.0 compat: access_ok() drops 'type' parameter access_ok no longer needs a 'type' parameter in the 5.0 kernel. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8261	2019-01-28 10:11:10 -08:00
Tony Hutter	5cb46f6a66	Linux 4.18 compat: Use ktime_get_coarse_real_ts64() Newer kernels remove current_kernel_time64(). Use ktime_get_coarse_real_ts64() in its place. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #8258	2019-01-28 10:11:03 -08:00
Serapheim Dimitropoulos	df72b8bebe	Rename range_tree_verify to range_tree_verify_not_present The range_tree_verify function looks for a segment in a range tree and panics if the segment is present on the tree. This patch gives the function a more descriptive name. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8327	2019-01-25 09:51:24 -08:00
Serapheim Dimitropoulos	b194fab0fb	Factor metaslab_load_wait() in metaslab_load() Most callers that need to operate on a loaded metaslab, always call metaslab_load_wait() before loading the metaslab just in case someone else is already doing the work. Factoring metaslab_load_wait() within metaslab_load() makes the later more robust, as callers won't have to do the load-wait check explicitly every time they need to load a metaslab. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8290	2019-01-18 11:10:32 -08:00
Serapheim Dimitropoulos	1a759200e5	Document guidelines for usage of zfs_dbgmsg Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8299	2019-01-18 10:16:56 -08:00
Tom Caputi	305781da4b	Fix error handling incallers of dbuf_hold_level() Currently, the functions dbuf_prefetch_indirect_done() and dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot fail. In the event of an error the former will cause a NULL pointer dereference and the later will trigger a VERIFY. This patch adds error handling to these functions and their callers where necessary. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8291	2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos	75058f3303	Remove unused vdev_t fields The following fields from the vdev_t struct are not used anywhere. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8285	2019-01-17 15:41:12 -08:00
Serapheim Dimitropoulos	61c3391acc	Serialize ZTHR operations to eliminate races Adds a new lock for serializing operations on zthrs. The commit also includes some code cleanup and refactoring. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8229	2019-01-13 10:09:46 -08:00

1 2 3 4 5 ...

1626 Commits