mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-06-02 20:34:08 +03:00

Author	SHA1	Message	Date
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Serapheim Dimitropoulos	843e9ca2e1	Introduce names for ZTHRs When debugging issues or generally analyzing the runtime of a system it would be nice to be able to tell the different ZTHRs running by name rather than having to analyze their stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10630	2020-07-29 09:43:33 -07:00
Matthew Macy	5678d3f593	Prefix zfs internal endian checks with _ZFS FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN LITTLE_ENDIAN on every architecture. Trying to do cross builds whilst hiding this from ZFS has proven extremely cumbersome. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10621	2020-07-28 13:02:49 -07:00
Matthew Ahrens	4fbdb10c7b	remove kmem_cache module parameter KMC_EXPIRE_AGE By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that objects will be removed from kmem cache magazines by `spl_kmem_cache_reap_now()`. There is also a module parameter to change this to `KMC_EXPIRE_AGE`, which establishes a maximum lifetime for objects to stay in the magazine. This setting has rarely, if ever, been used, and is not regularly tested. This commit removes the code for `KMC_EXPIRE_AGE`, and associated module parameters. Additionally, the unused module parameter `spl_kmem_cache_obj_per_slab_min` is removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10608	2020-07-24 09:39:26 -07:00
Matthew Ahrens	026e529cb3	Remove skc_reclaim, hdr_recl, kmem_cache shrinker The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`, whereby individual caches can register a callback to be invoked when there is memory pressure. This mechanism is used in only one place: the ARC registers the `hdr_recl()` reclaim function. This function wakes up the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and `arc_reduce_target_size()`. The `skc_reclaim` callbacks are invoked only by shrinker callbacks and `arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When called from shrinker callbacks, we are already aware of memory pressure and responding to it. Therefore there is little benefit to ever calling the `hdr_recl()` `skc_reclaim` callback. The `arc_reap_zthr` also wakes once a second, and if memory is low when allocating an ARC buffer. Therefore, additionally waking it from the shrinker calbacks has little benefit. The shrinker callbacks can be invoked very frequently, e.g. 10,000 times per second. Additionally, for invocation of the shrinker callback, skc_reclaim is invoked many times. Therefore, this mechanism consumes significant amounts of CPU time. The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which, in addition to invoking `skc_reclaim()`, does two things to attempt to free pages for use by the system: 1. Return free objects from the magazine layer to the slab layer 2. Return entirely-free slabs to the page layer (i.e. free pages) These actions apply only to caches implemented by the SPL, not those that use the underlying kernel SLAB/SLUB caches. The SPL caches are used for objects >=32KB, which are primarily linear ABD's cached in the DBUF cache. These actions (freeing objects from the magazine layer and returning entirely-free slabs) are also taken whenever a `kmem_cache_free()` call finds a full magazine. So there would typically be zero entirely-free slabs, and the number of objects in magazines is limited (typically no more than 64 objects per magazine, and there's one magazine per CPU). Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is modest. We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when memory pressure is detected. Therefore, calling `spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed. This commit removes the `skc_reclaim` mechanism, its only callback `hdr_recl()`, and the kmem_cache shrinker callback. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10576	2020-07-19 09:58:30 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Matthew Ahrens	e59a377a8f	filesystem_limit/snapshot_limit is incorrectly enforced against root The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8226 Closes #10545	2020-07-11 17:18:02 -07:00
Matthew Ahrens	3c42c9ed84	Clean up OS-specific ARC and kmem code OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10499	2020-06-29 09:01:07 -07:00
Arvind Sankar	4d8e68c42f	Avoid installing kernel headers on FreeBSD The kernel headers are installed for DKMS on linux, so don't install them unless we're building on linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10506	2020-06-27 17:40:14 -07:00
Brian Behlendorf	67b1362f04	Style fixes * Fix cstyle issue in shrinker.h which exceeded 80 columns. * Silence shellcheck warning in zpool.d/smart script. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2020-06-27 17:38:55 -07:00
Matthew Ahrens	270ece24b6	Revise SPL wrapper for shrinker callbacks The SPL provides a wrapper for the kernel's shrinker callbacks, which enables the ZFS code to interface with multiple versions of the shrinker API's from different kernel versions. Specifically, Linux kernels 3.0 - 3.11 has a single "combined" callback, and Linux kernels 3.12 and later have two "split" callbacks. The SPL provides a wrapper function so that the ZFS code only needs to implement one version of the callbacks. Currently the SPL's wrappers are designed such that the ZFS code implements the older, "combined" callback. There are a few downsides to this approach: * The general design within ZFS is for the latest Linux kernel to be considered the "first class" API. * The newer, "split" callback API is easier to understand, because each callback has one purpose. * The current wrappers do not completely abstract out the differing API's, so ZFS code needs `#ifdef` code to handle the differing return values required for different kernel versions. This commit addresses these drawbacks by having the ZFS code provide the latest, "split" callbacks, and the SPL provides a wrapping function for the older, "combined" API. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10502	2020-06-27 10:27:02 -07:00
Serapheim Dimitropoulos	ec1fea4516	Use percpu_counter for obj_alloc counter of Linux-backed caches A previous commit enabled the tracking of object allocations in Linux-backed caches from the SPL layer for debuggability. The commit is: 9a170fc6fe54f1e852b6c39630fe5ef2bbd97c16 Unfortunately, it also introduced minor performance regressions that were highlighted by the ZFS perf test-suite. Within Delphix we found that the regression would be from -1%, all the way up to -8% for some workloads. This commit brings performance back up to par by creating a separate counter for those caches and making it a percpu in order to avoid lock-contention. The initial performance testing was done by myself, and the final round was conducted by @tonynguien who was also the one that discovered the regression and highlighted the culprit. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10397	2020-06-26 18:06:50 -07:00
Arvind Sankar	6b99fc0620	Fixes for make dist Reduce the usage of EXTRA_DIST. If files are conditionally included in _SOURCES, _HEADERS etc, automake is smart enough to dist all files that could possibly be included, but this does not apply to EXTRA_DIST, resulting in make dist depending on the configuration. Add some files that were missing altogether in various Makefile's. The changes to disted files in this commit (excluding deleted files): +./cmd/zed/agents/README.md +./etc/init.d/README.md +./lib/libspl/os/freebsd/getexecname.c +./lib/libspl/os/freebsd/gethostid.c +./lib/libspl/os/freebsd/getmntany.c +./lib/libspl/os/freebsd/mnttab.c -./lib/libzfs/libzfs_core.pc -./lib/libzfs/libzfs.pc +./lib/libzfs/os/freebsd/libzfs_compat.c +./lib/libzfs/os/freebsd/libzfs_fsshare.c +./lib/libzfs/os/freebsd/libzfs_ioctl_compat.c +./lib/libzfs/os/freebsd/libzfs_zmount.c +./lib/libzutil/os/freebsd/zutil_compat.c +./lib/libzutil/os/freebsd/zutil_device_path_os.c +./lib/libzutil/os/freebsd/zutil_import_os.c +./module/lua/README.zfs +./module/os/linux/spl/README.md +./tests/README.md +./tests/zfs-tests/tests/functional/cli_root/zfs_clone/zfs_clone_rm_nested.ksh +./tests/zfs-tests/tests/functional/cli_root/zfs_send/zfs_send_encrypted_unloaded.ksh +./tests/zfs-tests/tests/functional/inheritance/README.config +./tests/zfs-tests/tests/functional/inheritance/README.state +./tests/zfs-tests/tests/functional/rsend/rsend_016_neg.ksh +./tests/zfs-tests/tests/perf/fio/sequential_readwrite.fio Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10501	2020-06-26 14:20:02 -07:00
Arvind Sankar	7513807320	Drop unnecessary srcdir paths There's no need to specify the srcdir explicitly in _HEADERS and EXTRA_DIST. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10493	2020-06-24 18:20:18 -07:00
Arvind Sankar	0ce2de637b	Add prototypes Add prototypes/move prototypes to header files. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:32 -07:00
Arvind Sankar	65c7cc49bf	Mark functions as static Mark functions used only in the same translation unit as static. This only includes functions that do not have a prototype in a header file either. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:20:38 -07:00
adilger	f734301d22	linux: add basic fallocate(mode=0/2) compatibility Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue #326 Closes #10408	2020-06-18 11:22:11 -07:00
Matthew Macy	8056a75672	Disambiguate condvar API contract On Illumos callers of cv_timedwait and cv_timedwait_hires can't distinguish between whether or not the cv was signaled or the call timed out. Illumos handles this (for some definition of handles) by calling cv_signal in the return path if we were signaled but the return value indicates instead that we timed out. This would make sense if it were possible to query the the cv for its net signal disposition. However, this isn't possible and, in spite of the fact that there are places in the code that clearly take a different and incompatible path if a timeout value is indicated, this distinction appears to be rather subtle to most developers. This problem is further compounded by the fact that on Linux, calling cv_signal in the return path wouldn't even do the right thing unless there are other waiters. Since it is possible for the caller to independently determine how much time is remaining but it is not possible to query if the cv was in fact signaled, prioritizing signalling over timeout seems like a cleaner solution. In addition, judging from usage patterns within the code itself, it is also less error prone. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10471	2020-06-18 10:17:50 -07:00
Jorgen Lundman	883a40fff4	Add convenience wrappers for common uio usage The macOS uio struct is opaque and the API must be used, this makes the smallest changes to the code for all platforms. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10412	2020-06-14 10:09:55 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
Michael Niewöhner	32f26eaa70	Move GFP flags kernel compatibility code Move the GFP flags kernel compat code from c file to kmem header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #10424	2020-06-08 16:33:46 -07:00
Michael Niewöhner	080102a1b6	Linux 5.8 compat: __vmalloc() The `pgprot` argument has been removed from `__vmalloc` in Linux 5.8, being `PAGE_KERNEL` always now [1]. Detect this during configure and define a wrapper for older kernels. [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/mm/vmalloc.c?h=next-20200605&id=88dca4ca5a93d2c09e5bbc6a62fbfc3af83c4fca Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #10422	2020-06-08 16:32:02 -07:00
наб	6059f3a1f6	Correctly handle the x32 ABI __x86_64__ && _ILP32 => don't forcibly define _LP64 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@gmail.com> Closes #10357 Closes #844	2020-05-28 10:28:20 -07:00
Paul B. Henson	a1af567bb6	OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics Porting notes: * Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter rather than 'zfsvfs_t zfsvfs' zfs man pages changes mixed between zfs and new zfsprops man pages Reviewed by: Aram Hvrneanu <aram@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Robert Gordon <rbg@openrbg.com> Reviewed by: Mark.Maybee@oracle.com Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@nexenta.com> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/742 OpenZFS-issue: https://www.illumos.org/issues/664 OpenZFS-issue: https://www.illumos.org/issues/279 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110 Closes #10266	2020-04-30 11:22:45 -07:00
Matthew Ahrens	1f043c8be1	Fix zfs send progress reporting The progress of a send is supposed to be reported by `zfs send -v`, but it is not. This works by creating a new user thread (with pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how much progress has been made. This IOCTL finds the specified send (since there may be multiple concurrent sends in the system). The IOCTL also checks that the specified send was started by the current process. On Linux, different threads of the same process are represented as different `struct task_struct`s (and, confusingly, have different PID's). To check if if two threads are in the same process, we need to check if they have the same `struct task_struct:group_leader`. We used to to this correctly, but it was inadvertently changed by `30af21b025` (Redacted Send) to simply check if the current `struct task_struct` is the one that started the send. This commit changes the code back to checking if the send was started by a `struct task_struct` with the same `group_leader` as the calling thread. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Chris Wedgwood <cw@f00f.org> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10215 Closes #10216	2020-04-20 10:12:48 -07:00
Ryan Moeller	a7929f3137	Update FreeBSD tunables Remove some obsolete legacy compat, rename some misnamed, and add some missing tunables for FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10203	2020-04-15 11:14:47 -07:00
Brian Behlendorf	68dde63d13	Linux 5.7 compat: blk_alloc_queue() Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified the blk_alloc_queue() interface by updating it to take the request queue as an argument. Add a wrapper function which accepts the new arguments and internally uses the available interfaces. Other minor changes include increasing the Linux-Maximum to 5.6 now that 5.6 has been released. It was not bumped to 5.7 because this release has not yet been finalized and is still subject to change. Added local 'struct zvol_state_os *zso' variable to zvol_alloc. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10181 Closes #10187	2020-04-09 09:16:46 -07:00
Ryan Moeller	7e3df9db12	Finish refactoring for ZFS_MODULE_PARAM_CALL Linux and FreeBSD have different parameters for tunable proc handler. This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL macro. To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS. With the declarations wired up we discovered an incorrect scope prefix for spa_slop_shift, so this is now fixed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10179	2020-04-07 10:06:22 -07:00
Matthew Ahrens	b3212d2fa6	Improve performance of zio_taskq_member __zio_execute() calls zio_taskq_member() to determine if we are running in a zio interrupt taskq, in which case we may need to switch to processing this zio in a zio issue taskq. The call to zio_taskq_member() can become a performance bottleneck when we are processing a high rate of zio's. zio_taskq_member() calls taskq_member() on each of the zio interrupt taskqs, of which there are 21. This is slow because each call to taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively slow. This commit improves the performance of zio_taskq_member() by having it cache the value of tsd_get(taskq_tsd), reducing the number of those calls to 1/21th of the current behavior. In a test case running `zfs send -c >/dev/null` of a filesystem with small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of one CPU, and with this change it is reduced to 1.3%. Overall time to perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000 blocks/sec). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10070	2020-03-03 10:29:38 -08:00
Brian Behlendorf	2c3a83701d	Linux 5.6 compat: time_t As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <mmacy@freebsd.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10052 Closes #10064	2020-02-27 09:31:02 -08:00
Brian Behlendorf	ff5587d651	Linux 5.6 compat: ktime_get_raw_ts64() The getrawmonotonic() and getrawmonotonic64() interfaces have been fully retired. Update gethrtime() to use the replacement interface ktime_get_raw_ts64() which was introduced in the 4.18 kernel. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10052 Closes #10064	2020-02-27 09:30:45 -08:00
Dirkjan Bussink	327000ce04	Remove zfs_getattr and convoff dead code The `convoff` function is called only in one code path in `zfs_space`. Each caller of `zfs_space` is called with a `flock64_t` that has `l_whence` set to `SEEK_SET`. This means that `convoff` always results in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and `int whence` is `SEEK_SET` as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Closes #10006	2020-02-24 15:38:22 -08:00
Ryan Moeller	5f087dda78	Enable zpool events tunables and tests on FreeBSD We have have made the necessary changes in our module code to expose zevents through both devd and the zpool events ioctl. Now the tunables can be exposed and zpool events tests can be enabled on both platforms. A few minor tweaks to the tests were needed to accommodate the way wc formats output on FreeBSD. zed remains to be ported. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10008	2020-02-18 11:22:56 -08:00
Matthew Macy	8b3547a481	Factor out some dbuf subroutines and add state change tracing Create dedicated dbuf_read_hole and dbuf_read_bonus. Additionally, add a dtrace probe to allow state change tracing. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Will Andrews <wca@FreeBSD.org> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Authored-by: Will Andrews <wca@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9923	2020-02-18 11:21:37 -08:00
Attila Fülöp	31b160f0a6	ICP: Improve AES-GCM performance Currently SIMD accelerated AES-GCM performance is limited by two factors: a. The need to disable preemption and interrupts and save the FPU state before using it and to do the reverse when done. Due to the way the code is organized (see (b) below) we have to pay this price twice for each 16 byte GCM block processed. b. Most processing is done in C, operating on single GCM blocks. The use of SIMD instructions is limited to the AES encryption of the counter block (AES-NI) and the Galois multiplication (PCLMULQDQ). This leads to the FPU not being fully utilized for crypto operations. To solve (a) we do crypto processing in larger chunks while owning the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced to make this chunk size tweakable. It defaults to 32 KiB. This step alone roughly doubles performance. (b) is tackled by porting and using the highly optimized openssl AES-GCM assembler routines, which do all the processing (CTR, AES, GMULT) in a single routine. Both steps together result in up to 32x reduction of the time spend in the en/decryption routines, leading up to approximately 12x throughput increase for large (128 KiB) blocks. Lastly, this commit changes the default encryption algorithm from AES-CCM to AES-GCM when setting the `encryption=on` property. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Jason King <jason.king@joyent.com> Reviewed-By: Tom Caputi <tcaputi@datto.com> Reviewed-By: Richard Laager <rlaager@wiktel.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #9749	2020-02-10 12:59:50 -08:00
Brian Behlendorf	795699a6cc	Linux 5.6 compat: timestamp_truncate() The timestamp_truncate() function was added, it replaces the existing timespec64_trunc() function. This change renames our wrapper function to be consistent with the upstream name and updates the compatibility code for older kernels accordingly. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9956 Closes #9961	2020-02-07 11:04:32 -08:00
Brian Behlendorf	0dd7364853	Linux 5.6 compat: struct proc_ops The proc_ops structure was introduced to replace the use of of the file_operations structure when registering proc handlers. This change creates a new kstat_proc_op_t typedef for compatibility which can be used to pass around the correct structure. This change additionally adds the 'const' keyword to all of the existing proc operations structures. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9961	2020-02-07 11:03:53 -08:00
Romain Dolbeau	35b07497c6	Add AltiVec RAID-Z Implements the RAID-Z function using AltiVec SIMD. This is basically the NEON code translated to AltiVec. Note that the 'fletcher' algorithm requires 64-bits operations, and the initial implementations of AltiVec (PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to 32-bits operations, so no 'fletcher'. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu> Closes #9539	2020-01-23 11:01:24 -08:00
Matthew Macy	13a9a6f5e8	Make zfs_replay.c work on FreeBSD FreeBSD's vfs currently doesn't permit file systems to do their own locking. To avoid having to have duplicate zfs functions with and without locking add locking here. With luck these changes can be removed in the future. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9715	2019-12-13 07:54:10 -08:00
Ryan Moeller	957c7aa23c	Relocate common quota functions to shared code The quota functions are common to all implementations and can be moved to common code. As a simplification they were moved to the Linux platform code in the initial refactoring. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes #9710	2019-12-11 12:12:08 -08:00
Matthew Macy	657ce25357	Eliminate Linux specific inode usage from common code Change many of the znops routines to take a znode rather than an inode so that zfs_replay code can be largely shared and in the future the much of the znops code may be shared. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9708	2019-12-11 11:53:57 -08:00
Fabian-Gruenbichler	b119e2c6f1	SIMD: Use alloc_pages_node to force alignment fxsave and xsave require the target address to be 16-/64-byte aligned. kmalloc(_node) does not (yet) offer such fine-grained control over alignment[0,1], even though it does "the right thing" most of the time for power-of-2 sizes. unfortunately, alignment is completely off when using certain debugging or hardening features/configs, such as KASAN, slub_debug=Z or the not-yet-upstream SLAB_CANARY. Use alloc_pages_node() instead which allows us to allocate page-aligned memory. Since fpregs_state is padded to a full page anyway, and this code is only relevant for x86 which has 4k pages, this approach should not allocate any unnecessary memory but still guarantee the needed alignment. 0: https://lwn.net/Articles/787740/ 1: https://lore.kernel.org/linux-block/20190826111627.7505-1-vbabka@suse.cz/ Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9608 Closes #9674	2019-12-10 12:53:25 -08:00
Matthew Macy	1f654753ba	Remove stale ASSERTV comment Followup for #9671, remove stale comment. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Issue #9671 Closes #9682	2019-12-06 09:33:27 -08:00
Matthew Macy	2a8ba608d3	Replace ASSERTV macro with compiler annotation Remove the ASSERTV macro and handle suppressing unused compiler warnings for variables only in ASSERTs using the __attribute__((unused)) compiler annotation. The annotation is understood by both gcc and clang. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9671	2019-12-05 12:37:00 -08:00
Matthew Macy	be627fc847	Refactor zfs_context.h to build on FreeBSD - on Linux move Linux specific headers to zfs_context_os.h - on FreeBSD move FreeBSD specific definitions to zfs_context_os.h - remove duplicate tsd_ definitions - remove unused AT_TYPE Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9668	2019-12-04 13:12:57 -08:00
Matthew Macy	74d1d74959	Move linux qsort def to platform header Moving qsort to the platform header allows each platform to provide an appropriate sorting implementation. Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9663	2019-12-03 09:49:40 -08:00
Matthew Macy	d6f67df63c	Minor diff reduction with ZoF in include/sys - move linux/ includes to platform headers - add void * io_bio to zio for tracking the underlying bio - add freebsd specific fields to abd_scatter Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9615	2019-11-27 11:11:03 -08:00
Brian Behlendorf	9e17e6f254	Remove zfs_vdev_elevator module option As described in commit `f81d5ef6` the zfs_vdev_elevator module option is being removed. Users who require this functionality should update their systems to set the disk scheduler using a udev rule. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8664 Closes #9417 Closes #9609	2019-11-27 10:35:49 -08:00
Matthew Macy	da92d5cbb3	Add zfs_file_* interface, remove vnodes Provide a common zfs_file_* interface which can be implemented on all platforms to perform normal file access from either the kernel module or the libzpool library. This allows all non-portable vnode_t usage in the common code to be replaced by the new portable zfs_file_t. The associated vnode and kobj compatibility functions, types, and macros have been removed from the SPL. Moving forward, vnodes should only be used in platform specific code when provided by the native operating system. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9556	2019-11-21 09:32:57 -08:00
Brian Behlendorf	7ae3f8dc8f	Partially revert `5a6ac4c` Reinstate the zpl_revalidate() functionality to resolve a regression where dentries for open files during a rollback are not invalidated. The unrelated functionality for automatically unmounting .zfs/snapshots was not reverted. Nor was the addition of shrink_dcache_sb() to the zfs_resume_fs() function. This issue was not immediately caught by the CI because the test case intended to catch it was included in the list of ZTS tests which may occasionally fail for unrelated reasons. Remove all of the rollback tests from this list to help identify the frequency of any spurious failures. The rollback_003_pos.ksh test case exposes a real issue with the long standing code which needs to be investigated. Regardless, it has been enable with a small workaround in the test case itself. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9587 Closes #9592	2019-11-18 13:05:56 -08:00

1 2

75 Commits