mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 10:54:35 +03:00

Author	SHA1	Message	Date
Matthew Macy	cd23154a8c	FreeBSD: Fix spurious failure in zvol_geom_open In zvol_geom_open on first open we need to guarantee that the namespace lock is held to avoid spurious failures in zvol_first_open. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10841	2020-08-30 14:11:33 -07:00
Matthew Macy	d436de6389	FreeBSD: add support for KSTAT_TYPE_RAW A few kstats use KSTAT_TYPE_RAW to provide a string generated on demand. Implementing these as sysctls was punted until now. Reviewed by: Toomas Soome <tsoome@me.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10836	2020-08-29 20:59:50 -07:00
Brian Behlendorf	3e29e1971b	Linux 5.9 compat: NR_SLAB_RECLAIMABLE Commit `dcdc12e` added compatibility code to treat NR_SLAB_RECLAIMABLE_B as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value is in bytes while the old value was in pages which means they are not interchangeable. The only place the reclaimable slab size is used is as a component of the calculation done by arc_free_memory(). This function returns the amount of memory the ARC considers to be free or reclaimable at little cost. Rather than switch to a new interface to get this value it has been removed it from the calculation. It is normally a minor component compared to the number of inactive or free pages, and removing it aligns the behavior with the FreeBSD version of arc_free_memory(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10834	2020-08-29 20:57:45 -07:00
Toomas Soome	ec0c480c14	Remove pragma ident lines The #pragma ident is a historical relic and not needed any more, this pragma is actually unknown for common compilers and is only causing trouble. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10810	2020-08-26 10:35:50 -07:00
youzhongyang	b900799768	Fix inability to destroy snapshot used over NFS The cache of struct svc_export and struct svc_expkey by nfsd and rpc.mountd for the snapshot holds references to the mount point. We need to flush them out before unmounting, otherwise umount would fail with EBUSY. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #6000 Closes #10783	2020-08-24 17:33:02 -07:00
Andrew	a741b386d3	Prevent zfs_acl_chmod() if aclmode restricted and ACL inherited In absence of inheriting entry for owner@, group@, or everyone@, zfs_acl_chmod() is called to set these. This can cause confusion for Samba admins who do not expect these entries to appear on newly created files and directories once they have been stripped from from the parent directory. When aclmode is set to "restricted", chmod is prevented on non-trivial ACLs. It is not a stretch to assume that in this case the administrator does not want ZFS to add the missing special entries. Add check for this aclmode, and if an inherited entry is present skip zfs_acl_chmod(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #10748	2020-08-22 21:49:25 -07:00
Ryan Moeller	6fe3498ca3	Import vdev ashift optimization from FreeBSD Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Macy <mmacy@freebsd.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10619	2020-08-21 12:53:17 -07:00
Mariusz Zaborski	f2c027bd6a	FreeBSD: Add option to rewind checkpoint while importing root pool This option is used by FreeBSD boot loader. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Closes #10738	2020-08-19 17:19:42 -07:00
Matthew Macy	716b53d0a1	FreeBSD: Fix UNIX permissions checking Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10727	2020-08-18 09:57:07 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Matthew Ahrens	85ec5cbae2	Include scatter_chunk_waste in arc_size The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10701	2020-08-17 20:04:04 -07:00
Matthew Ahrens	994de7e4b7	Remove KMC_KMEM and KMC_VMEM `KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented caches are `KMC_KVMEM`. KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we don't use kmalloc to back the SPL caches, instead we use kvmalloc (KMC_KVMEM). The flag, module parameter, /proc entries, and associated code are removed. KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to vmalloc(). The flag, /proc entries, and associated code are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10673	2020-08-17 16:04:28 -07:00
Matthew Macy	cfdc432e64	FreeBSD: fix merge error in zfs_acl_ids_create Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10721	2020-08-17 15:28:03 -07:00
Ryan Moeller	3c3d7c8a57	FreeBSD: Create taskq threads in appropriate proc Stepping stone toward re-enabling spa_thread on FreeBSD. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10715	2020-08-17 11:01:19 -07:00
Jorgen Lundman	faa296c73c	Release onexit/events with any missed zfsdev_state Linux and FreeBSD will most likely never see this issue. On macOS when kext is unloaded, but zed is still connected, zed will be issued ENODEV. As the cdevsw is released, the kernel will not have zfsdev_release() called to release minor/onexit/events, and it "leaks". This ensures it is cleaned up before unload. Changed the for loop from zsprev, to zsnext style, for less code duplication. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10700	2020-08-13 15:03:23 -07:00
Coleman Kane	d817c17100	Linux 5.9 compat: make_request_fn replaced with submit_bio interface The make_request_fn and associated API was replaced recently in a Linux 5.9 merge, to replace its functionality with a new submit_bio member in struct block_device_operations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #10696	2020-08-11 13:37:33 -07:00
Ryan Moeller	ed726fb063	Move ZVOL_DIR back to zfs.h This was previously moved because nothing else in-tree uses it, but evidently DilOS uses it out of tree. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Ryan Moeller <freqlabs@freebsd.org> Closes #10361 Closes #10685	2020-08-11 13:12:12 -07:00
Matthew Macy	0f95ddcc0c	FreeBSD: update vaccess signature on most recent HEAD Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10682	2020-08-07 14:16:01 -07:00
Matthew Ahrens	0cab7970f9	Remove commented-out code Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:18 -07:00
Matthew Ahrens	c6f2b942be	Remove KMC_NOMAGAZINE Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:07 -07:00
Matthew Ahrens	3c09f6949a	Remove KMC_QCACHE Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:01 -07:00
Matthew Ahrens	d519c10575	Remove KMC_NOHASH Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:27:56 -07:00
Matthew Ahrens	f68af67a0c	Remove KMC_NOTOUCH Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:27:46 -07:00
Matthew Ahrens	492db125dc	Remove KMC_OFFSLAB Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:25:37 -07:00
Matthew Macy	1b376d176e	FreeBSD: Add support for lockless lookup Authored-by: mjg <mjg@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10657	2020-08-05 10:19:51 -07:00
Matthew Macy	fe628bc21d	Fix page fault in zfsctl_snapdir_getattr Must acquire the z_teardown_lock before accessing the zfsvfs_t object. I can't reproduce this panic on demand, but this looks like the correct solution. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: asomers <asomers@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10656	2020-08-01 08:42:55 -07:00
Matthew Macy	47ed79ff60	Changes to make openzfs build within FreeBSD buildworld A collection of header changes to enable FreeBSD to build with vendored OpenZFS. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10635	2020-07-31 21:30:31 -07:00
Ryan Moeller	0cc3454821	Convert Linux-isms to FreeBSD-isms in platform zfs_debug.c Change some comments copied from the Linux code to describe the appropriate methods on FreeBSD. Convert some tunables to ZFS_MODULE_PARAM so they get created on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10647	2020-07-31 21:25:35 -07:00
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Matthew Macy	27d96d2254	Rename refcount.h to zfs_refcount.h Renamed to avoid conflicting with refcount.h when a different implementation is already provided by the platform. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10620	2020-07-29 16:35:33 -07:00
Matthew Macy	e64cc4954c	Refactor ccompile.h to not include system headers This is a step toward being able to vendor the OpenZFS code in FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10625	2020-07-25 20:09:50 -07:00
Matthew Macy	6d8da84106	Make use of ZFS_DEBUG consistent within kmod sources Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10623	2020-07-25 20:07:44 -07:00
Ryan Moeller	d364de7a89	FreeBSD: Remove accidental ARC size limiter i386 has some additional memory reservation logic that limits the size of the reported available memory. This was accidentally being used on all arches due to a missing header. Include machine/vmparam.h in freebsd/zfs/arc_os.c to pull in the missing UMA_MD_SMALL_ALLOC definition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10616	2020-07-25 10:49:49 -07:00
Ryan Moeller	cb18d88060	FreeBSD: Implement arc_free_memory This is only used for the kstat, but something other than 0 is nice. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10626	2020-07-25 10:47:18 -07:00
Matthew Ahrens	4fbdb10c7b	remove kmem_cache module parameter KMC_EXPIRE_AGE By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that objects will be removed from kmem cache magazines by `spl_kmem_cache_reap_now()`. There is also a module parameter to change this to `KMC_EXPIRE_AGE`, which establishes a maximum lifetime for objects to stay in the magazine. This setting has rarely, if ever, been used, and is not regularly tested. This commit removes the code for `KMC_EXPIRE_AGE`, and associated module parameters. Additionally, the unused module parameter `spl_kmem_cache_obj_per_slab_min` is removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10608	2020-07-24 09:39:26 -07:00
Ryan Moeller	f7a68f99d0	FreeBSD: Remove some code duplication in sysctl_os.c Drop unnecessary redefinition's of several arcstat values. Put missing extern declaration of arc_no_grow_shift in arc_impl.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10609	2020-07-23 17:35:34 -07:00
Matthew Ahrens	5dd92909c6	Adjust ARC terminology The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10592	2020-07-22 09:51:47 -07:00
Ryan Moeller	0421f257b2	FreeBSD: Add legacy arc_min and arc_max These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10579	2020-07-19 10:15:34 -07:00
Matthew Ahrens	026e529cb3	Remove skc_reclaim, hdr_recl, kmem_cache shrinker The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`, whereby individual caches can register a callback to be invoked when there is memory pressure. This mechanism is used in only one place: the ARC registers the `hdr_recl()` reclaim function. This function wakes up the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and `arc_reduce_target_size()`. The `skc_reclaim` callbacks are invoked only by shrinker callbacks and `arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When called from shrinker callbacks, we are already aware of memory pressure and responding to it. Therefore there is little benefit to ever calling the `hdr_recl()` `skc_reclaim` callback. The `arc_reap_zthr` also wakes once a second, and if memory is low when allocating an ARC buffer. Therefore, additionally waking it from the shrinker calbacks has little benefit. The shrinker callbacks can be invoked very frequently, e.g. 10,000 times per second. Additionally, for invocation of the shrinker callback, skc_reclaim is invoked many times. Therefore, this mechanism consumes significant amounts of CPU time. The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which, in addition to invoking `skc_reclaim()`, does two things to attempt to free pages for use by the system: 1. Return free objects from the magazine layer to the slab layer 2. Return entirely-free slabs to the page layer (i.e. free pages) These actions apply only to caches implemented by the SPL, not those that use the underlying kernel SLAB/SLUB caches. The SPL caches are used for objects >=32KB, which are primarily linear ABD's cached in the DBUF cache. These actions (freeing objects from the magazine layer and returning entirely-free slabs) are also taken whenever a `kmem_cache_free()` call finds a full magazine. So there would typically be zero entirely-free slabs, and the number of objects in magazines is limited (typically no more than 64 objects per magazine, and there's one magazine per CPU). Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is modest. We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when memory pressure is detected. Therefore, calling `spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed. This commit removes the `skc_reclaim` mechanism, its only callback `hdr_recl()`, and the kmem_cache shrinker callback. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10576	2020-07-19 09:58:30 -07:00
Brian Behlendorf	e862b7ecfc	Linux 4.10 compat: has_capability() Stock kernels older than 4.10 do not export the has_capability() function which is required by commit `e59a377`. To avoid breaking the build on older kernels revert to the safe legacy behavior and return EACCES when privileges cannot be checked. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10565 Closes #10573	2020-07-19 09:56:21 -07:00
Matthew Ahrens	8fbf432ae2	anon_pages are not free/evictable `arc_free_memory()` returns the amount of memory that the ARC considers to be free. This includes pages that are not actually free, but can be evicted with essentially zero cost (without doing any i/o), for example the page cache. The ARC can "squeeze out" any pages included in this calculation, leaving only `arc_sys_free` (1/64th of RAM) for these free/evictable pages. Included in the count of free/evictable pages is `nr_inactive_anon_pages()`, which is described as "Anonymous memory that has not been used recently and can be swapped out". These pages would have to be written out to disk (swap) in order to evict them, and they are not included in `/proc/meminfo`'s `MemAvailable`. Therefore it is not appropriate for `nr_inactive_anon_pages()` to be included in the free/evictable memory returned by `arc_free_memory()`, because the ARC shouldn't (intentionally) make the system swap. This commit removes `nr_inactive_anon_pages()` from the memory returned by `arc_free_memory()`. This is a step towards enabling the ARC to manage free memory by monitoring it and reducing the ARC size as we notice that there is insufficient free memory (in the `arc_reap_zthr`), rather than the current method of relying on the `arc_shrinker` callback. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10575	2020-07-16 10:11:26 -07:00
Matthew Macy	23c871671c	FreeBSD: zfs commands backward compatibility Update the zfs commands such that they're backwards compatible with the version of ZFS is the base FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10542	2020-07-15 21:32:50 -07:00
Romain Dolbeau	01a4852ecb	Fix early include of <linux/percpu_compat.h> Move/add include of <linux/percpu_compat.h> to satisfy missing requirements. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain@dolbeau.org> Closes #10568 Closes #10569	2020-07-15 15:58:15 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Matthew Ahrens	e59a377a8f	filesystem_limit/snapshot_limit is incorrectly enforced against root The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8226 Closes #10545	2020-07-11 17:18:02 -07:00
Matthew Macy	3933305eac	FreeBSD: Use a hash table for taskqid lookups Previously a tqent could be recycled prematurely, update the code to use a hash table for lookups to resolve this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10529	2020-07-11 17:13:45 -07:00
Mark Johnston	cd32b4f5b7	Fix a deadlock in the FreeBSD getpages VOP FreeBSD has a per-page "busy" lock which is held when handling a page fault on a mapped file. This lock is also acquired when copying data from the DMU to the page cache in zfs_write(). File range locks are also acquired in both of these paths, in the opposite order with respect to the busy lock. In the getpages VOP, the range lock is only used to determine the extent of optional read-ahead and read-behind operations. To resolve the lock order reversal, modify the getpages VOP to avoid blocking on the range lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10519	2020-07-06 11:53:57 -07:00
Ryan Moeller	8a3d9186ba	Update zfs_freebsd_need_inactive to fix mmapped writes `zfs_freebsd_need_inactive` appears to been based on an unfinished version of https://reviews.freebsd.org/D22130 which had a bug where files written via mmap wouldn't actually persist. Update the function to match the final version committed to FreeBSD. Authored-by: Mateusz Guzik <mjg@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10527 Closes #10528	2020-07-03 11:30:04 -07:00
Matthew Macy	7ddb753d17	freebsd: changes necessary to coexist with dtrace in tree Fix header conflicts when building zfs with openzfs as a vendor import. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10497	2020-07-01 09:10:08 -07:00
Matthew Ahrens	3c42c9ed84	Clean up OS-specific ARC and kmem code OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10499	2020-06-29 09:01:07 -07:00

1 2 3 4

194 Commits