mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
Ryan Moeller	c0bd2e0fe2	Drop references when skipping dmu_send due to EXDEV When an invalid incremental send is requested where the "to" ds is before the "from" ds, make sure to drop the reference to the pool and the dataset before returning the error. Add an assert on FreeBSD to make sure we don't hold any locks after returning from an ioctl. Add some test coverage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10919	2020-09-30 13:19:49 -07:00
Brian Behlendorf	2e407941a2	Fix PREEMPTION=y and BLK_CGROUP=y config on arm64 With PREEMPTION=y and BLK_CGROUP=y preempt_schedule_notrace() is being used on arm64 which is a GPL-only function and hence the build of the DKMS kernel module fails. Fix that by redefining preempt_schedule_notrace() to preempt_schedule() which should be safe as long as tracing is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Juerg Haefliger <juergh@canonical.com> Closes #8545 Closes #9948 Closes #10416 Closes #10973	2020-09-25 13:28:35 -07:00
Mateusz Guzik	f6bb7c029c	FreeBSD: update cache_purgevfs usage after 1300117 version bump Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Nick Wolff <darkfiberiru@gmail.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #10970	2020-09-25 13:23:43 -07:00
Ryan Moeller	fa1912e80f	FreeBSD: Code cleanup in zio_crypt Address some unused value and control flow issues flagged by Coverity. Unreachable code is pruned and unused values are avoided. Some scattered sections are reordered for coherence. We can assume kmem_alloc(n, KM_SLEEP) doesn't fail, so there is no need to check if it returned NULL. The allocated memory doesn't need to be zeroed, other than the last iovec (the MAC). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10884	2020-09-25 13:12:35 -07:00
Matthew Macy	7b8363d7f0	FreeBSD: Add support for procfs_list The procfs_list interface is required by several kstats. Implement this functionality for FreeBSD to provide access to these kstats. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10890	2020-09-23 16:43:51 -07:00
Matthew Macy	3dad29fb4b	FreeBSD: Don't save user FPU context in kernel threads Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10899	2020-09-23 11:09:48 -07:00
Mark Johnston	0daa0320e9	Fix a logic bug in the FreeBSD getpages VOP In commit `cd32b4f5b7` ("Fix a deadlock in the FreeBSD getpages VOP") I introduced a bug while porting the patch originally committed to FreeBSD: the rangelock pointer may be NULL if the try operation failed, so we must avoid calling zfs_rangelock_unlock() in that case. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reported-by: Steve Wills <swills@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10519 Closes #10960	2020-09-22 16:05:52 -07:00
Mark Johnston	6bdb09510b	Annontate FreeBSD sysctls with CTLFLAG_MPSAFE Without this, the sysctl system calls will acquire a global lock before invoking the handler. This is noticeable in some situations when running top(1). The global lock is mostly vestigal but continues to see some use and so contention is still a problem; until the default sense of the MPSAFE flag changes, we have to annotate each and every handler. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10836	2020-09-21 09:49:50 -07:00
Mark Johnston	c50f3c902f	Fix switch statement indentation in the FreeBSD kstat code This is in preparation for some functional changes. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10950	2020-09-21 09:49:12 -07:00
George Wilson	c494aa7f57	vdev_ashift should only be set once == Motivation and Context The new vdev ashift optimization prevents the removal of devices when a zfs configuration is comprised of disks which have different logical and physical block sizes. This is caused because we set 'spa_min_ashift' in vdev_open and then later call 'vdev_ashift_optimize'. This would result in an inconsistency between spa's ashift calculations and that of the top-level vdev. In addition, the optimization logical ignores the overridden ashift value that would be provided by '-o ashift=<val>'. == Description This change reworks the vdev ashift optimization so that it's only set the first time the device is configured. It still allows the physical and logical ahsift values to be set every time the device is opened but those values are only consulted on first open. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-Issue: DLPX-71831 Closes #10932	2020-09-18 12:13:47 -07:00
George Wilson	8e82ffba7b	pool may become suspended during device expansion When expanding a device zfs needs to rescan the partition table to get the correct size. This can only happen when we're in the kernel and requires the device to be closed. As part of the rescan, udev is notified and the device links are removed and recreated. This leave a window where the vdev code may try to reopen the device before udev has recreated the link. If that happens, then the pool may end up in a suspended state. To correct this, we leverage the BLKPG_RESIZE_PARTITION ioctl which allows the partition information to be modified even while it's in use. This ioctl also does not remove the device link associated with the zfs data partition so it eliminates the race condition that can occur in the kernel. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #10897	2020-09-17 20:03:10 -07:00
Ryan Moeller	3c7566cb0d	FreeBSD: Do not copy vp into f_data for DTYPE_VNODE files https://reviews.freebsd.org/D26346 Do not copy vp into f_data for DTYPE_VNODE files. The vnode pointer is already stored in f_vnode. Use that so f_data can be reused. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10929	2020-09-17 10:54:14 -07:00
John Poduska	5bed68bdc4	Need a long hold in zpl_mount_impl In zpl_mount_impl, there is: dmu_objset_hold ; returns with pool & ds held dsl_pool_rele sget dsl_dataset_rele As spelled out in the "DSL Pool Configuration Lock" in dsl_pool.c, this requires a long hold. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #10936	2020-09-17 10:53:02 -07:00
Ryan Moeller	7ead2be3d2	Rename acltype=posixacl to acltype=posix Prefer acltype=off\|posix, retaining the old names as aliases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10918	2020-09-16 12:26:06 -07:00
Ryan Moeller	37325e4749	Linux: Prevent destruction while showing mount devname Use ZFS_ENTER and ZFS_EXIT to protect datasets while their mount devname is being retrieved. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10892 Closes #10927	2020-09-15 15:40:03 -07:00
Mateusz Guzik	8e7fe49b25	FreeBSD: convert teardown inactive lock to a read-mostly sleepable lock The lock is taken all the time and as a regular read-write lock avoidably serves as a mount point-wide contention point. This forward ports FreeBSD revision r357322. To quote aforementioned commit: Sample result doing an incremental -j 40 build: before: 173.30s user 458.97s system 2595% cpu 24.358 total after: 168.58s user 254.92s system 2211% cpu 19.147 total Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #10896	2020-09-09 10:15:52 -07:00
Ryan Moeller	cba6d86638	FreeBSD: drop dependency on cryptodev module We only need the kernel interfaces in crypto, not the device node in cryptodev. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10901	2020-09-09 10:10:32 -07:00
Matthew Macy	ac6e5fb202	Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate There are a number of places where cv_?_sig is used simply for accounting purposes but the surrounding code has no ability to cope with actually receiving a signal. On FreeBSD it is possible to send signals to individual kernel threads so this could enable undesirable behavior. This patch adds routines on Linux that will do the same idle accounting as _sig without making the task interruptible. On FreeBSD cv__idle are all aliases for cv_ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10843	2020-09-03 20:04:09 -07:00
Spencer Kinny	6fffc88bf3	Links in Source Files Added comments in following files with links to Illumos manual pages: ./module/avl/avl.c ./module/nvpair/nvpair.c ./module/os/linux/spl/spl-kstat.c ./module/os/freebsd/spl/spl_kstat.c Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Spencer Kinny <spencerkinny1995@gmail.com> Closes #5113 Closes #10859	2020-09-02 09:42:12 -07:00
Toomas Soome	0db1b84a6d	zvol: unsigned off can not be less than zero Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10867	2020-09-02 09:30:29 -07:00
Matthew Macy	e84e49218f	FreeBSD: Fix up after spa_stats.c move Moving spa_stats added the additional burden of supporting KSTAT_TYPE_IO. spa_state_addr will always return a valid value regardless of the value of 'n'. On FreeBSD this will cause an infinite loop as it relies on the raw ops addr routine to indicate that there is no more data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10860	2020-09-01 16:16:56 -07:00
Ryan Moeller	7b4e27232d	Add 'zfs rename -u' to rename without remounting Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' property set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back. This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children). In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible. Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Matt Macy <mmacy@freebsd.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10839	2020-09-01 16:14:16 -07:00
Ryan Moeller	88d19d7cc2	FreeBSD: Remove unused SECLABEL code SECLABEL is undefined on FreeBSD and should be pruned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10847	2020-08-31 19:52:46 -07:00
Ryan Moeller	eff621071f	FreeBSD: Simplify INGLOBALZONE FreeBSD's previous ZFS implemented INGLOBALZONE(thread) as (!jailed((thread)->td_ucred)) and passed curthread to INGLOBALZONE. We pass curproc instead of curthread, so we can achieve the same effect with (!jailed((proc)->p_ucred)). The implementation is trivial enough to fit on a single line in a define. We don't really need a whole separate function for something that's already macros all the way down. Eliminate in_globalzone. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10851	2020-08-31 19:43:08 -07:00
Matthew Macy	7bb18b94c7	Move spa_stats.c to common code Initially it was considered simplest to stub out all of the functions on FreeBSD. Now that FreeBSD supports KSTAT_TYPE_RAW at least some of the functionality should be made available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10842	2020-08-30 14:12:46 -07:00
Matthew Macy	cd23154a8c	FreeBSD: Fix spurious failure in zvol_geom_open In zvol_geom_open on first open we need to guarantee that the namespace lock is held to avoid spurious failures in zvol_first_open. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10841	2020-08-30 14:11:33 -07:00
Matthew Macy	d436de6389	FreeBSD: add support for KSTAT_TYPE_RAW A few kstats use KSTAT_TYPE_RAW to provide a string generated on demand. Implementing these as sysctls was punted until now. Reviewed by: Toomas Soome <tsoome@me.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10836	2020-08-29 20:59:50 -07:00
Brian Behlendorf	3e29e1971b	Linux 5.9 compat: NR_SLAB_RECLAIMABLE Commit `dcdc12e` added compatibility code to treat NR_SLAB_RECLAIMABLE_B as if it were the same as NR_SLAB_RECLAIMABLE. However, the new value is in bytes while the old value was in pages which means they are not interchangeable. The only place the reclaimable slab size is used is as a component of the calculation done by arc_free_memory(). This function returns the amount of memory the ARC considers to be free or reclaimable at little cost. Rather than switch to a new interface to get this value it has been removed it from the calculation. It is normally a minor component compared to the number of inactive or free pages, and removing it aligns the behavior with the FreeBSD version of arc_free_memory(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10834	2020-08-29 20:57:45 -07:00
Toomas Soome	ec0c480c14	Remove pragma ident lines The #pragma ident is a historical relic and not needed any more, this pragma is actually unknown for common compilers and is only causing trouble. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10810	2020-08-26 10:35:50 -07:00
youzhongyang	b900799768	Fix inability to destroy snapshot used over NFS The cache of struct svc_export and struct svc_expkey by nfsd and rpc.mountd for the snapshot holds references to the mount point. We need to flush them out before unmounting, otherwise umount would fail with EBUSY. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Youzhong Yang <yyang@mathworks.com> Closes #6000 Closes #10783	2020-08-24 17:33:02 -07:00
Andrew	a741b386d3	Prevent zfs_acl_chmod() if aclmode restricted and ACL inherited In absence of inheriting entry for owner@, group@, or everyone@, zfs_acl_chmod() is called to set these. This can cause confusion for Samba admins who do not expect these entries to appear on newly created files and directories once they have been stripped from from the parent directory. When aclmode is set to "restricted", chmod is prevented on non-trivial ACLs. It is not a stretch to assume that in this case the administrator does not want ZFS to add the missing special entries. Add check for this aclmode, and if an inherited entry is present skip zfs_acl_chmod(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Walker <awalker@ixsystems.com> Closes #10748	2020-08-22 21:49:25 -07:00
Ryan Moeller	6fe3498ca3	Import vdev ashift optimization from FreeBSD Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Macy <mmacy@freebsd.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10619	2020-08-21 12:53:17 -07:00
Mariusz Zaborski	f2c027bd6a	FreeBSD: Add option to rewind checkpoint while importing root pool This option is used by FreeBSD boot loader. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <oshogbo@vexillium.org> Closes #10738	2020-08-19 17:19:42 -07:00
Matthew Macy	716b53d0a1	FreeBSD: Fix UNIX permissions checking Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10727	2020-08-18 09:57:07 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Matthew Ahrens	85ec5cbae2	Include scatter_chunk_waste in arc_size The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10701	2020-08-17 20:04:04 -07:00
Matthew Ahrens	994de7e4b7	Remove KMC_KMEM and KMC_VMEM `KMC_KMEM` and `KMC_VMEM` are now unused since all SPL-implemented caches are `KMC_KVMEM`. KMC_KMEM: Given the default value of `spl_kmem_cache_kmem_limit`, we don't use kmalloc to back the SPL caches, instead we use kvmalloc (KMC_KVMEM). The flag, module parameter, /proc entries, and associated code are removed. KMC_VMEM: This flag is not used, and kvmalloc() is always preferable to vmalloc(). The flag, /proc entries, and associated code are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10673	2020-08-17 16:04:28 -07:00
Matthew Macy	cfdc432e64	FreeBSD: fix merge error in zfs_acl_ids_create Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10721	2020-08-17 15:28:03 -07:00
Ryan Moeller	3c3d7c8a57	FreeBSD: Create taskq threads in appropriate proc Stepping stone toward re-enabling spa_thread on FreeBSD. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10715	2020-08-17 11:01:19 -07:00
Jorgen Lundman	faa296c73c	Release onexit/events with any missed zfsdev_state Linux and FreeBSD will most likely never see this issue. On macOS when kext is unloaded, but zed is still connected, zed will be issued ENODEV. As the cdevsw is released, the kernel will not have zfsdev_release() called to release minor/onexit/events, and it "leaks". This ensures it is cleaned up before unload. Changed the for loop from zsprev, to zsnext style, for less code duplication. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10700	2020-08-13 15:03:23 -07:00
Coleman Kane	d817c17100	Linux 5.9 compat: make_request_fn replaced with submit_bio interface The make_request_fn and associated API was replaced recently in a Linux 5.9 merge, to replace its functionality with a new submit_bio member in struct block_device_operations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #10696	2020-08-11 13:37:33 -07:00
Ryan Moeller	ed726fb063	Move ZVOL_DIR back to zfs.h This was previously moved because nothing else in-tree uses it, but evidently DilOS uses it out of tree. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Ryan Moeller <freqlabs@freebsd.org> Closes #10361 Closes #10685	2020-08-11 13:12:12 -07:00
Matthew Macy	0f95ddcc0c	FreeBSD: update vaccess signature on most recent HEAD Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10682	2020-08-07 14:16:01 -07:00
Matthew Ahrens	0cab7970f9	Remove commented-out code Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:18 -07:00
Matthew Ahrens	c6f2b942be	Remove KMC_NOMAGAZINE Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:07 -07:00
Matthew Ahrens	3c09f6949a	Remove KMC_QCACHE Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:28:01 -07:00
Matthew Ahrens	d519c10575	Remove KMC_NOHASH Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:27:56 -07:00
Matthew Ahrens	f68af67a0c	Remove KMC_NOTOUCH Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:27:46 -07:00
Matthew Ahrens	492db125dc	Remove KMC_OFFSLAB Remove dead code to make the implementation easier to understand. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Ahrens <matt@delphix.com> Closes #10650	2020-08-05 10:25:37 -07:00
Matthew Macy	1b376d176e	FreeBSD: Add support for lockless lookup Authored-by: mjg <mjg@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10657	2020-08-05 10:19:51 -07:00
Matthew Macy	fe628bc21d	Fix page fault in zfsctl_snapdir_getattr Must acquire the z_teardown_lock before accessing the zfsvfs_t object. I can't reproduce this panic on demand, but this looks like the correct solution. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: asomers <asomers@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10656	2020-08-01 08:42:55 -07:00
Matthew Macy	47ed79ff60	Changes to make openzfs build within FreeBSD buildworld A collection of header changes to enable FreeBSD to build with vendored OpenZFS. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10635	2020-07-31 21:30:31 -07:00
Ryan Moeller	0cc3454821	Convert Linux-isms to FreeBSD-isms in platform zfs_debug.c Change some comments copied from the Linux code to describe the appropriate methods on FreeBSD. Convert some tunables to ZFS_MODULE_PARAM so they get created on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10647	2020-07-31 21:25:35 -07:00
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Matthew Macy	27d96d2254	Rename refcount.h to zfs_refcount.h Renamed to avoid conflicting with refcount.h when a different implementation is already provided by the platform. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10620	2020-07-29 16:35:33 -07:00
Matthew Macy	e64cc4954c	Refactor ccompile.h to not include system headers This is a step toward being able to vendor the OpenZFS code in FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10625	2020-07-25 20:09:50 -07:00
Matthew Macy	6d8da84106	Make use of ZFS_DEBUG consistent within kmod sources Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10623	2020-07-25 20:07:44 -07:00
Ryan Moeller	d364de7a89	FreeBSD: Remove accidental ARC size limiter i386 has some additional memory reservation logic that limits the size of the reported available memory. This was accidentally being used on all arches due to a missing header. Include machine/vmparam.h in freebsd/zfs/arc_os.c to pull in the missing UMA_MD_SMALL_ALLOC definition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10616	2020-07-25 10:49:49 -07:00
Ryan Moeller	cb18d88060	FreeBSD: Implement arc_free_memory This is only used for the kstat, but something other than 0 is nice. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10626	2020-07-25 10:47:18 -07:00
Matthew Ahrens	4fbdb10c7b	remove kmem_cache module parameter KMC_EXPIRE_AGE By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that objects will be removed from kmem cache magazines by `spl_kmem_cache_reap_now()`. There is also a module parameter to change this to `KMC_EXPIRE_AGE`, which establishes a maximum lifetime for objects to stay in the magazine. This setting has rarely, if ever, been used, and is not regularly tested. This commit removes the code for `KMC_EXPIRE_AGE`, and associated module parameters. Additionally, the unused module parameter `spl_kmem_cache_obj_per_slab_min` is removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10608	2020-07-24 09:39:26 -07:00
Ryan Moeller	f7a68f99d0	FreeBSD: Remove some code duplication in sysctl_os.c Drop unnecessary redefinition's of several arcstat values. Put missing extern declaration of arc_no_grow_shift in arc_impl.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10609	2020-07-23 17:35:34 -07:00
Matthew Ahrens	5dd92909c6	Adjust ARC terminology The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10592	2020-07-22 09:51:47 -07:00
Ryan Moeller	0421f257b2	FreeBSD: Add legacy arc_min and arc_max These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10579	2020-07-19 10:15:34 -07:00
Matthew Ahrens	026e529cb3	Remove skc_reclaim, hdr_recl, kmem_cache shrinker The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`, whereby individual caches can register a callback to be invoked when there is memory pressure. This mechanism is used in only one place: the ARC registers the `hdr_recl()` reclaim function. This function wakes up the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and `arc_reduce_target_size()`. The `skc_reclaim` callbacks are invoked only by shrinker callbacks and `arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When called from shrinker callbacks, we are already aware of memory pressure and responding to it. Therefore there is little benefit to ever calling the `hdr_recl()` `skc_reclaim` callback. The `arc_reap_zthr` also wakes once a second, and if memory is low when allocating an ARC buffer. Therefore, additionally waking it from the shrinker calbacks has little benefit. The shrinker callbacks can be invoked very frequently, e.g. 10,000 times per second. Additionally, for invocation of the shrinker callback, skc_reclaim is invoked many times. Therefore, this mechanism consumes significant amounts of CPU time. The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which, in addition to invoking `skc_reclaim()`, does two things to attempt to free pages for use by the system: 1. Return free objects from the magazine layer to the slab layer 2. Return entirely-free slabs to the page layer (i.e. free pages) These actions apply only to caches implemented by the SPL, not those that use the underlying kernel SLAB/SLUB caches. The SPL caches are used for objects >=32KB, which are primarily linear ABD's cached in the DBUF cache. These actions (freeing objects from the magazine layer and returning entirely-free slabs) are also taken whenever a `kmem_cache_free()` call finds a full magazine. So there would typically be zero entirely-free slabs, and the number of objects in magazines is limited (typically no more than 64 objects per magazine, and there's one magazine per CPU). Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is modest. We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when memory pressure is detected. Therefore, calling `spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed. This commit removes the `skc_reclaim` mechanism, its only callback `hdr_recl()`, and the kmem_cache shrinker callback. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10576	2020-07-19 09:58:30 -07:00
Brian Behlendorf	e862b7ecfc	Linux 4.10 compat: has_capability() Stock kernels older than 4.10 do not export the has_capability() function which is required by commit `e59a377`. To avoid breaking the build on older kernels revert to the safe legacy behavior and return EACCES when privileges cannot be checked. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10565 Closes #10573	2020-07-19 09:56:21 -07:00
Matthew Ahrens	8fbf432ae2	anon_pages are not free/evictable `arc_free_memory()` returns the amount of memory that the ARC considers to be free. This includes pages that are not actually free, but can be evicted with essentially zero cost (without doing any i/o), for example the page cache. The ARC can "squeeze out" any pages included in this calculation, leaving only `arc_sys_free` (1/64th of RAM) for these free/evictable pages. Included in the count of free/evictable pages is `nr_inactive_anon_pages()`, which is described as "Anonymous memory that has not been used recently and can be swapped out". These pages would have to be written out to disk (swap) in order to evict them, and they are not included in `/proc/meminfo`'s `MemAvailable`. Therefore it is not appropriate for `nr_inactive_anon_pages()` to be included in the free/evictable memory returned by `arc_free_memory()`, because the ARC shouldn't (intentionally) make the system swap. This commit removes `nr_inactive_anon_pages()` from the memory returned by `arc_free_memory()`. This is a step towards enabling the ARC to manage free memory by monitoring it and reducing the ARC size as we notice that there is insufficient free memory (in the `arc_reap_zthr`), rather than the current method of relying on the `arc_shrinker` callback. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10575	2020-07-16 10:11:26 -07:00
Matthew Macy	23c871671c	FreeBSD: zfs commands backward compatibility Update the zfs commands such that they're backwards compatible with the version of ZFS is the base FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10542	2020-07-15 21:32:50 -07:00
Romain Dolbeau	01a4852ecb	Fix early include of <linux/percpu_compat.h> Move/add include of <linux/percpu_compat.h> to satisfy missing requirements. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain@dolbeau.org> Closes #10568 Closes #10569	2020-07-15 15:58:15 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Matthew Ahrens	e59a377a8f	filesystem_limit/snapshot_limit is incorrectly enforced against root The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8226 Closes #10545	2020-07-11 17:18:02 -07:00
Matthew Macy	3933305eac	FreeBSD: Use a hash table for taskqid lookups Previously a tqent could be recycled prematurely, update the code to use a hash table for lookups to resolve this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10529	2020-07-11 17:13:45 -07:00
Mark Johnston	cd32b4f5b7	Fix a deadlock in the FreeBSD getpages VOP FreeBSD has a per-page "busy" lock which is held when handling a page fault on a mapped file. This lock is also acquired when copying data from the DMU to the page cache in zfs_write(). File range locks are also acquired in both of these paths, in the opposite order with respect to the busy lock. In the getpages VOP, the range lock is only used to determine the extent of optional read-ahead and read-behind operations. To resolve the lock order reversal, modify the getpages VOP to avoid blocking on the range lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10519	2020-07-06 11:53:57 -07:00
Ryan Moeller	8a3d9186ba	Update zfs_freebsd_need_inactive to fix mmapped writes `zfs_freebsd_need_inactive` appears to been based on an unfinished version of https://reviews.freebsd.org/D22130 which had a bug where files written via mmap wouldn't actually persist. Update the function to match the final version committed to FreeBSD. Authored-by: Mateusz Guzik <mjg@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10527 Closes #10528	2020-07-03 11:30:04 -07:00
Matthew Macy	7ddb753d17	freebsd: changes necessary to coexist with dtrace in tree Fix header conflicts when building zfs with openzfs as a vendor import. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10497	2020-07-01 09:10:08 -07:00
Matthew Ahrens	3c42c9ed84	Clean up OS-specific ARC and kmem code OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10499	2020-06-29 09:01:07 -07:00
Matthew Ahrens	270ece24b6	Revise SPL wrapper for shrinker callbacks The SPL provides a wrapper for the kernel's shrinker callbacks, which enables the ZFS code to interface with multiple versions of the shrinker API's from different kernel versions. Specifically, Linux kernels 3.0 - 3.11 has a single "combined" callback, and Linux kernels 3.12 and later have two "split" callbacks. The SPL provides a wrapper function so that the ZFS code only needs to implement one version of the callbacks. Currently the SPL's wrappers are designed such that the ZFS code implements the older, "combined" callback. There are a few downsides to this approach: * The general design within ZFS is for the latest Linux kernel to be considered the "first class" API. * The newer, "split" callback API is easier to understand, because each callback has one purpose. * The current wrappers do not completely abstract out the differing API's, so ZFS code needs `#ifdef` code to handle the differing return values required for different kernel versions. This commit addresses these drawbacks by having the ZFS code provide the latest, "split" callbacks, and the SPL provides a wrapping function for the older, "combined" API. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10502	2020-06-27 10:27:02 -07:00
Serapheim Dimitropoulos	ec1fea4516	Use percpu_counter for obj_alloc counter of Linux-backed caches A previous commit enabled the tracking of object allocations in Linux-backed caches from the SPL layer for debuggability. The commit is: 9a170fc6fe54f1e852b6c39630fe5ef2bbd97c16 Unfortunately, it also introduced minor performance regressions that were highlighted by the ZFS perf test-suite. Within Delphix we found that the regression would be from -1%, all the way up to -8% for some workloads. This commit brings performance back up to par by creating a separate counter for those caches and making it a percpu in order to avoid lock-contention. The initial performance testing was done by myself, and the final round was conducted by @tonynguien who was also the one that discovered the regression and highlighted the culprit. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10397	2020-06-26 18:06:50 -07:00
Matthew Ahrens	67c0f0dedc	ARC shrinking blocks reads/writes ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to allow the ARC to shrink when the kernel experiences memory pressure. The ARC shrinker changes `arc_c` via a call to `arc_reduce_target_size()`. Before commit `3ec34e5527`, the ARC shrinker would also evict data from the ARC to bring `arc_size` down to the new `arc_c`. However, that commit (seemingly inadvertently) made it so that the ARC shrinker no longer evicts any data or waits for eviction to complete. Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often all the way to `arc_c_min`. Since it doesn't wait for the actual eviction of data from the ARC, this creates a situation where `arc_size` is more than `arc_c` for the several seconds/minutes it takes for `arc_adjust_zthr` to evict data from the ARC. During this time, arc_get_data_impl() will block, so ZFS can't process read/write requests (e.g. from iSCSI, NFS, or read/write syscalls). To ensure that `arc_c` doesn't shrink faster than the adjust thread can keep up, this commit makes the ARC shrinker wait for the eviction to complete, resulting in similar behavior to what we had before commit `3ec34e5527`. Note: commit `3ec34e5527` is `OpenZFS 9284 - arc_reclaim_thread has 2 jobs` and was integrated in December 2018, and is part of ZoL 0.8.x but not 0.7.x. Additionally, when the ARC size is reduced drastically, the `arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any threads that are bound to the same CPU that arc_adjust_zthr is running on will not able to run for a long time. To ensure that CPU-bound threads can make progress, this commit changes `arc_evict_state_impl()` make a voluntary preemption call, `cond_resched()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-70703 Closes #10496	2020-06-26 10:42:27 -07:00
Ryan Moeller	9192f27c1d	Add zfs_multihost_interval tunable handler for FreeBSD This tunable required a handler to be implemented for ZFS_MODULE_PARAM_CALL. Add the handler so the tunable can be declared in common code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10490	2020-06-23 13:32:42 -07:00
Matthew Ahrens	540493ba4f	Clarify comments in config/.m4, vdev_geom.c, zfs_allow_.ksh Rephrase comments to be more clear. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10481	2020-06-22 09:46:37 -07:00
Ryan Moeller	1c08fa8b5b	Fix copy-paste error breaking FreeBSD head Resolve the FreeBSD head build failure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10480	2020-06-19 15:12:34 -07:00
Ryan Moeller	2e6af52b2e	Match new vfs_checkexp KPI in FreeBSD head KPI changed in FreeBSD, update accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10475	2020-06-18 13:45:36 -07:00
Arvind Sankar	c0673571d0	Switch off -Wmissing-prototypes for libgcc math functions spl-generic.c defines some of the libgcc integer library functions on 32-bit. Don't bother checking -Wmissing-prototypes since nothing should directly call these functions from C code. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:46 -07:00
Arvind Sankar	0ce2de637b	Add prototypes Add prototypes/move prototypes to header files. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:32 -07:00
Arvind Sankar	60356b1a21	Add include files for prototypes Include the header with prototypes in the file that provides definitions as well, to catch any mismatch between prototype and definition. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:25 -07:00
Arvind Sankar	c3fe42aabd	Remove dead code Delete unused functions. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:18 -07:00
Arvind Sankar	65c7cc49bf	Mark functions as static Mark functions used only in the same translation unit as static. This only includes functions that do not have a prototype in a header file either. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:20:38 -07:00
adilger	f734301d22	linux: add basic fallocate(mode=0/2) compatibility Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue #326 Closes #10408	2020-06-18 11:22:11 -07:00
Matthew Macy	8056a75672	Disambiguate condvar API contract On Illumos callers of cv_timedwait and cv_timedwait_hires can't distinguish between whether or not the cv was signaled or the call timed out. Illumos handles this (for some definition of handles) by calling cv_signal in the return path if we were signaled but the return value indicates instead that we timed out. This would make sense if it were possible to query the the cv for its net signal disposition. However, this isn't possible and, in spite of the fact that there are places in the code that clearly take a different and incompatible path if a timeout value is indicated, this distinction appears to be rather subtle to most developers. This problem is further compounded by the fact that on Linux, calling cv_signal in the return path wouldn't even do the right thing unless there are other waiters. Since it is possible for the caller to independently determine how much time is remaining but it is not possible to query if the cv was in fact signaled, prioritizing signalling over timeout seems like a cleaner solution. In addition, judging from usage patterns within the code itself, it is also less error prone. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10471	2020-06-18 10:17:50 -07:00
Matthew Macy	7564073ed6	Add abd_cache_reap_now for abd_chunk_cache users Apparently missed in the initial port integration was the need to reap the abd_chunk_cache on FreeBSD. This change addresses that oversight. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10474	2020-06-17 21:44:13 -07:00
Ryan Moeller	86a0f49483	FreeBSD: Kernel module should depend on xdr not krpc after 1300092 Since https://reviews.freebsd.org/D24408 FreeBSD provides XDR functions in the xdr module instead of krpc. For FreeBSD 13, the MODULE_DEPEND should be changed to xdr Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10442 Closes #10443	2020-06-16 11:47:04 -07:00
Jorgen Lundman	d366c8fd7a	Make struct vdev_disk_t be platform private Linux defines different vdev_disk_t members to macOS, but they are only used in vdev_disk.c so move the declaration there. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10452	2020-06-16 11:43:33 -07:00
Brian Atkinson	0a03495e3e	Fixing ABD struct allocation for FreeBSD In the event we are allocating a gang ABD in FreeBSD we are passing 0 to abd_alloc_struct(); however, this led to an allocation of ABD scatter with 0 chunks. This left the gang ABD allocation 24 bytes smaller than it should have been. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10431	2020-06-16 10:05:22 -07:00
Brian Atkinson	e08b993396	Removing ZERO_PAGE abd_alloc_zero_scatter For MIPS architectures on Linux the ZERO_PAGE macro references empty_zero_page, which is exported as a GPL symbol. The call to ZERO_PAGE in abd_alloc_zero_scatter has been removed and a single zero'd page is now allocated for each of the pages in abd_zero_scatter in the kernel ABD code path. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10428	2020-06-10 17:54:11 -07:00
Ryan Moeller	feff3f69fc	Fixup "Avoid the GEOM topology lock recursion when autoexpanding a pool" The patch was applied to vdev_geom_open instead of vdev_geom_close by mistake. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10427	2020-06-10 11:05:15 -07:00
Arvind Sankar	71504277ae	Cleanup linux module kbuild files The linux module can be built either as an external module, or compiled into the kernel, using copy-builtin. The source and build directories are slightly different between the two cases, and currently, compiling into the kernel still refers to some files from the configured ZFS source tree, instead of the copies inside the kernel source tree. There is also duplication between copy-builtin, which creates a Kbuild file to build ZFS inside the kernel tree, and the top-level module/Makefile.in. Fix this by moving the list of modules and the CFLAGS settings into a new module/Kbuild.in, which will be used by the kernel kbuild infrastructure, and using KBUILD_EXTMOD to distinguish the two cases within the Makefiles, in order to choose appropriate include directories etc. Module CFLAGS setting is simplified by using subdir-ccflags-y (available since 2.6.30) to set them in the top-level Kbuild instead of each individual module. The disabling of -Wunused-but-set-variable is removed from the lua and zfs modules. The variable that the Makefile uses is actually not defined, so this has no effect; and the warning has long been disabled by the kernel Makefile itself. The target_cpu definition in module/{zfs,zcommon} is removed as it was replaced by use of CONFIG_SPARC64 in commit `70835c5b75` ("Unify target_cpu handling") os/linux/{spl,zfs} are removed from obj-m, as they are not modules in themselves, but are included by the Makefile in the spl and zfs module directories. The vestigial Makefiles in os and os/linux are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10379 Closes #10421	2020-06-10 09:24:15 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
Matthew Ahrens	7bcb7f0840	File incorrectly zeroed when receiving incremental stream that toggles -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #6224 Closes #10383	2020-06-09 10:41:01 -07:00
George Amanakis	b7654bd794	Trim L2ARC The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224	2020-06-09 10:15:08 -07:00
Michael Niewöhner	32f26eaa70	Move GFP flags kernel compatibility code Move the GFP flags kernel compat code from c file to kmem header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #10424	2020-06-08 16:33:46 -07:00
Michael Niewöhner	080102a1b6	Linux 5.8 compat: __vmalloc() The `pgprot` argument has been removed from `__vmalloc` in Linux 5.8, being `PAGE_KERNEL` always now [1]. Detect this during configure and define a wrapper for older kernels. [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/mm/vmalloc.c?h=next-20200605&id=88dca4ca5a93d2c09e5bbc6a62fbfc3af83c4fca Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #10422	2020-06-08 16:32:02 -07:00
Pawel Jakub Dawidek	529246df96	Restore support for in-kernel ZFS ioctls In Illumos it is possible to call ioctl functions from within the kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support that, but it doesn't hurt to keep it around, as all the code is there. Before this commit it was a dead code and zc_iflags was always zero. Restore this functionality by allowing to pass a flag to the zfsdev_ioctl_common() function. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #10417	2020-06-08 13:57:22 -07:00
Pawel Jakub Dawidek	77b998fa70	Remove redundant includes By removing excessive includes it takes us a small step close to compiling this file in userland. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #10415	2020-06-08 09:57:36 -07:00
Jorgen Lundman	c9e319faae	Replace sprintf()->snprintf() and strcpy()->strlcpy() The strcpy() and sprintf() functions are deprecated on some platforms. Care is needed to ensure correct size is used. If some platforms miss snprintf, we can add a #define to sprintf, likewise strlcpy(). The biggest change is adding a size parameter to zfs_id_to_fuidstr(). The various *_impl_get() functions are only used on linux and have not yet been updated. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10400	2020-06-07 11:42:12 -07:00
Ryan Moeller	60265072e0	Improve compatibility with C++ consumers C++ is a little picky about not using keywords for names, or string constness. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10409	2020-06-06 12:54:04 -07:00
Brian Behlendorf	63761a8f1a	zfsvfs_setup(): zap_stats_t may have undefined content when accessed (#10398 ) Signed-off-by: Allan Jude <allanjude@klarasystems.com> Co-authored-by: Allan Jude <allanjude@klarasystems.com>	2020-06-05 17:21:04 -07:00
Allan Jude	4547fc4e07	Connect dataset_kstats for FreeBSD Expand the FreeBSD spl for kstats to support all current types Move the dataset_kstats_t back to zvol_state_t from zfs_state_os_t now that it is common once again ``` kstat.zfs/mypool.dataset.objset-0x10b.nunlinked: 0 kstat.zfs/mypool.dataset.objset-0x10b.nunlinks: 0 kstat.zfs/mypool.dataset.objset-0x10b.nread: 150528 kstat.zfs/mypool.dataset.objset-0x10b.reads: 48 kstat.zfs/mypool.dataset.objset-0x10b.nwritten: 134217728 kstat.zfs/mypool.dataset.objset-0x10b.writes: 1024 kstat.zfs/mypool.dataset.objset-0x10b.dataset_name: mypool/datasetname ``` Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #10386	2020-06-05 17:17:02 -07:00
Allan Jude	7da304bbce	zfsvfs_setup(): zap_stats_t may have undefined content when accessed Signed-off-by: Allan Jude <allanjude@klarasystems.com>	2020-06-03 22:20:27 +00:00
Ryan Moeller	c0eb5c35ef	FreeBSD: Simplify zvol and fix locking zvol_geom_bio_strategy should handle its own use of the zvol suspend reader lock and ensure the zilog exists when needed. A few other places using the zvol zilog should use the suspend reader lock as well. Simplify consumers of zvol_geom_bio_strategy, fix the locking, and while in here, use the boolean_t constants with doread. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10381	2020-06-03 10:45:12 -07:00
Matthew Macy	3bf3b164ee	Fix crypto build on FreeBSD HEAD Update API usage to reflect recent change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10384	2020-05-30 12:54:57 -07:00
Brian Atkinson	fb822260b1	Gang ABD Type Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10069	2020-05-20 18:06:09 -07:00
Ryan Moeller	7b0e39030c	freebsd: Correct the order of arguments to copyin() for Q_SETQUOTA Sponsored by: DARPA External-issue: https://reviews.freebsd.org/D24656 FreeBSD-commit: freebsd/freebsd@a431c095d3 Authored by: jhb <jhb@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10344	2020-05-19 16:45:25 -07:00
Kyle Evans	5b090d57d4	freebsd: return EISDIR for read(2) on directories This is arguably a change for internal consistency within OpenZFS, as the Linux implementation will reject read(2) on directories with EISDIR. It's not unreasonable for read(2) to do something here on FreeBSD, but we don't currently copy out anything useful anyways so start rejecting it with the appropriate error. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Kyle Evans <kevans@FreeBSD.org> Closes #10338	2020-05-16 10:12:01 -07:00
Ryan Moeller	d2782af461	Fix ZVOL_DIR We only use ZVOL_DIR on FreeBSD, and on FreeBSD it isn't correct. Move the definition to the file where it is needed, and define it as /dev/zvol/. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10337	2020-05-16 10:10:38 -07:00
yparitcher	cdcce2f019	Fix VN_OPEN_INVFS typo The VN_OPEN_INVFS literal is in the wrong field. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: yparitcher <y@paritcher.com> Closes #10322	2020-05-14 20:47:14 -07:00
Brian Behlendorf	2ade659eb4	Fix abd_enter/exit_critical wrappers Commit `fc551d7` introduced the wrappers abd_enter_critical() and abd_exit_critical() to mark critical sections. On Linux these are implemented with the local_irq_save() and local_irq_restore() macros which set the 'flags' argument when saving. By wrapping them with a function the local variable is no longer set by the macro and is no longer properly restored. Convert abd_enter_critical() and abd_exit_critical() to macros to resolve this issue and ensure the flags are properly restored. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10332	2020-05-14 20:45:16 -07:00
Brian Atkinson	fc551d7efb	Combine OS-independent ABD Code into Common Source File Reorganizing ABD code base so OS-independent ABD code has been placed into a common abd.c file. OS-dependent ABD code has been left in each OS's ABD source files, and these source files have been renamed to abd_os. The OS-independent ABD code is now under: module/zfs/abd.c With the OS-dependent code in: module/os/linux/zfs/abd_os.c module/os/freebsd/zfs/abd_os.c Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10293	2020-05-10 12:23:52 -07:00
Paul Dagnelie	108a454a46	Add support for boot environment data to be stored in the label Modern bootloaders leverage data stored in the root filesystem to enable some of their powerful features. GRUB specifically has a grubenv file which can store large amounts of configuration data that can be read and written at boot time and during normal operation. This allows sysadmins to configure useful features like automated failover after failed boot attempts. Unfortunately, due to the Copy-on-Write nature of ZFS, the standard behavior of these tools cannot handle writing to ZFS files safely at boot time. We need an alternative way to store data that allows the bootloader to make changes to the data. This work is very similar to work that was done on Illumos to enable similar functionality in the FreeBSD bootloader. This patch is different in that the data being stored is a raw grubenv file; this file can store arbitrary variables and values, and the scripting provided by grub is powerful enough that special structures are not required to implement advanced behavior. We repurpose the second padding area in each label to store the grubenv file, protected by an embedded checksum. We add two ioctls to get and set this data, and libzfs_core and libzfs functions to access them more easily. There are no direct command line interfaces to these functions; these will be added directly to the bootloader utilities. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10009	2020-05-07 09:36:33 -07:00
Ryan Moeller	ddc7a2dd3b	taskq: Don't leak system_delay_taskq on FreeBSD Adds a missing taskq_destroy() call. Reported by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10292	2020-05-05 09:36:41 -07:00
Ryan Moeller	6f3e1a4828	Avoid the GEOM topology lock recursion when autoexpanding a pool The steps to reproduce the problem: mdconfig -a -t swap -s 3g -u 0 gpart create -s GPT md0 gpart add -t freebsd-zfs -s 1g md0 zpool create -o autoexpand=on foo md0p1 gpart resize -i 1 -s 2g md0 Authored by: pjd <pjd@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@bccd2db598 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Ported-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10270	2020-05-04 15:10:41 -07:00
Ryan Moeller	639dfeb831	Update FreeBSD SPL atomics Sync up with the following changes from FreeBSD: ZFS: add emulation of atomic_swap_64 and atomic_load_64 Some 32-bit platforms do not provide 64-bit atomic operations that ZFS requires, either in userland or at all. We emulate those operations for those platforms using a mutex. That is not entirely correct and it's very efficient. Besides, the loads are plain loads, so torn values are possible. Nevertheless, the emulation seems to work for some definition of work. This change adds atomic_swap_64, which is already used in ZFS code, and atomic_load_64 that can be used to prevent torn reads. Authored by: avg <avg@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@3458e5d1e6 cleanup of illumos compatibility atomics atomic_cas_32 is implemented using atomic_fcmpset_32 on all platforms. Ditto for atomic_cas_64 and atomic_fcmpset_64 on platforms that have it. The only exception is sparc64 that provides MD atomic_cas_32 and atomic_cas_64. This is slightly inefficient as fcmpset reports whether the operation updated the target and that information is not needed for cas. Nevertheless, there is less code to maintain and to add for new platforms. Also, the operations are done inline now as opposed to function calls before. atomic_add_64_nv is implemented using atomic_fetchadd_64 on platforms that provide it. casptr, cas32, atomic_or_8, atomic_or_8_nv are completely removed as they have no users. atomic_mtx that is used to emulate 64-bit atomics on platforms that lack them is defined only on those platforms. As a result, platform specific opensolaris_atomic.S files have lost most of their code. The only exception is i386 where the compat+contrib code provides 64-bit atomics for userland use. That code assumes availability of cmpxchg8b instruction. FreeBSD does not have that assumption for i386 userland and does not provide 64-bit atomics. Hopefully, this can and will be fixed. Authored by: avg <avg@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@e9642c209b emulate illumos membar_producer with atomic_thread_fence_rel membar_producer is supposed to be a store-store barrier. Also, in the code that FreeBSD has ported from illumos membar_producer is used only with regular stores to regular memory (with respect to caching). We do not have an MI primitive for the store-store barrier, so atomic_thread_fence_rel is the closest we have as it provides (load \| store) -> store barrier. Previously, membar_producer was an empty function call on all 32-bit arm-s, 32-bit powerpc, riscv and all mips variants. I think that it was inadequate. On other platforms, such as amd64, arm64, i386, powerpc64, sparc64, membar_producer was implemented using stronger primitives than required for a store-store barrier with respect to regular memory access. For example, it used sfence on amd64 and lock-ed nop in i386 (despite TSO). On powerpc64 we now use recommended lwsync instead of eieio. On sparc64 FreeBSD uses TSO mode. On arm64/aarch64 we now use dmb sy instead of dmb ish. Not sure if this is an improvement, actually. After this change we can drop opensolaris_atomic.S for aarch64, amd64, powerpc64 and sparc64 as all required atomic operations have either direct or light-weight mapping to FreeBSD native atomic operations. Discussed with: kib Authored by: avg <avg@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@50cdda62fc fix up r353340, don't assume that fcmpset has strong semantics fcmpset can have two kinds of semantics, weak and strong. For practical purposes, strong semantics means that if fcmpset fails then the reported current value is always different from the expected value. Weak semantics means that the reported current value may be the same as the expected value even though fcmpset failed. That's a so called "sporadic" failure. I originally implemented atomic_cas expecting strong semantics, but many platforms actually have weak one. Reported by: pkubaj (not confirmed if same issue) Discussed with: kib, mjg Authored by: avg <avg@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@238787c74e [PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations This is a lock-based emulation of 64-bit atomics for kernel use, split off from an earlier patch by jhibbits. This is needed to unblock future improvements that reduce the need for locking on 64-bit platforms by using atomic updates. The implementation allows for future integration with userland atomic64, but as that implies going through sysarch for every use, the current status quo of userland doing its own locking may be for the best. Submitted by: jhibbits (original patch), kevans (mips bits) Reviewed by: jhibbits, jeff, kevans Authored by: bdragon <bdragon@FreeBSD.org> Differential Revision: https://reviews.freebsd.org/D22976 FreeBSD-commit: freebsd/freebsd@db39dab3a8 Remove sparc64 kernel support Remove all sparc64 specific files Remove all sparc64 ifdefs Removee indireeect sparc64 ifdefs Authored by: imp <imp@FreeBSD.org> FreeBSD-commit: freebsd/freebsd@48b94864c5 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Ported-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10250	2020-05-04 15:07:04 -07:00
Paul B. Henson	0aeb0bed6f	OpenZFS 6765 - zfs_zaccess_delete() comments do not accurately reflect delete permissions for ACLs Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Paul B. Henson <henson@acm.org> Porting Notes: * Only comments are updated OpenZFS-issue: https://www.illumos.org/issues/6765 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/da412744bc Closes #10266	2020-04-30 11:24:55 -07:00
Paul B. Henson	235a856576	OpenZFS 6762 - POSIX write should imply DELETE_CHILD on directories - and some additional considerations Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/6762 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1eb4e906ec Closes #10266	2020-04-30 11:24:38 -07:00
Paul B. Henson	99495ba6ab	OpenZFS 8984 - fix for 6764 breaks ACL inheritance Authored by: Dominik Hassler <hadfl@omniosce.org> Reviewed by: Sam Zaydel <szaydel@racktopsystems.com> Reviewed by: Paul B. Henson <henson@acm.org> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/8984 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e9bacc6d1a Closes #10266	2020-04-30 11:24:27 -07:00
Paul B. Henson	5a2f527d4b	OpenZFS 6764 - zfs issues with inheritance flags during chmod(2) with aclmode=passthrough Authored by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/6764 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de0f1ddb59 Closes #10266	2020-04-30 11:24:14 -07:00
Paul B. Henson	7bf3e1fa0f	OpenZFS 3254 - add support in zfs for aclmode=restricted Authored-by: Paul B. Henson <henson@acm.org> Reviewed by: Albert Lee <trisk@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/3254 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/71dbfc287c Closes #10266	2020-04-30 11:23:59 -07:00
Paul B. Henson	a1af567bb6	OpenZFS 742 - Resurrect the ZFS "aclmode" property OpenZFS 664 - Umask masking "deny" ACL entries OpenZFS 279 - Bug in the new ACL (post-PSARC/2010/029) semantics Porting notes: * Updated zfs_acl_chmod to take 'boolean_t isdir' as first parameter rather than 'zfsvfs_t zfsvfs' zfs man pages changes mixed between zfs and new zfsprops man pages Reviewed by: Aram Hvrneanu <aram@nexenta.com> Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Robert Gordon <rbg@openrbg.com> Reviewed by: Mark.Maybee@oracle.com Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@nexenta.com> Ported-by: Paul B. Henson <henson@acm.org> OpenZFS-issue: https://www.illumos.org/issues/742 OpenZFS-issue: https://www.illumos.org/issues/664 OpenZFS-issue: https://www.illumos.org/issues/279 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c49ce110 Closes #10266	2020-04-30 11:22:45 -07:00
Ryan Moeller	a8085184d6	Fix zlib leak on FreeBSD zlib_inflateEnd was accidentally a wrapper for inflateInit instead of inflateEnd, and hilarity ensues. Fix the typo so we free memory instead of allocating more. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10225 Closes #10252	2020-04-28 09:14:30 -07:00
Matthew Macy	c614fd6e12	Use new FreeBSD API to largely eliminate object locking Propagate changes in HEAD that mostly eliminate object locking. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10205	2020-04-17 09:30:26 -07:00
Ryan Moeller	a7929f3137	Update FreeBSD tunables Remove some obsolete legacy compat, rename some misnamed, and add some missing tunables for FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10203	2020-04-15 11:14:47 -07:00
Matthew Macy	9f0a21e641	Add FreeBSD support to OpenZFS Add the FreeBSD platform code to the OpenZFS repository. As of this commit the source can be compiled and tested on FreeBSD 11 and 12. Subsequent commits are now required to compile on FreeBSD and Linux. Additionally, they must pass the ZFS Test Suite on FreeBSD which is being run by the CI. As of this commit 1230 tests pass on FreeBSD and there are no unexpected failures. Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #898 Closes #8987	2020-04-14 11:36:28 -07:00
Matthew Ahrens	20f287855a	zvol_write() can use dmu_tx_hold_write_by_dnode() We can improve the performance of writes to zvols by using dmu_tx_hold_write_by_dnode() instead of dmu_tx_hold_write(). This reduces lock contention on the first block of the dnode object, and also reduces the amount of CPU needed. The benefit will be highest with multi-threaded async writes (i.e. writes that don't call zil_commit()). Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10184	2020-04-10 21:14:01 -07:00
George Amanakis	77f6826b83	Persistent L2ARC This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Saso Kiselkov <skiselkov@gmail.com> Co-authored-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: George Amanakis <gamanakis@gmail.com> Ported-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582	2020-04-10 10:33:35 -07:00
Ryan Moeller	36a6e2335c	Don't ignore zfs_arc_max below allmem/32 Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower than the default allmem/32 zfs_arc_max can also be set lower. Add warning messages when tunables are being ignored. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10157 Closes #10158	2020-04-09 15:39:48 -07:00
Brian Behlendorf	68dde63d13	Linux 5.7 compat: blk_alloc_queue() Commit https://github.com/torvalds/linux/commit/3d745ea5 simplified the blk_alloc_queue() interface by updating it to take the request queue as an argument. Add a wrapper function which accepts the new arguments and internally uses the available interfaces. Other minor changes include increasing the Linux-Maximum to 5.6 now that 5.6 has been released. It was not bumped to 5.7 because this release has not yet been finalized and is still subject to change. Added local 'struct zvol_state_os *zso' variable to zvol_alloc. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10181 Closes #10187	2020-04-09 09:16:46 -07:00
Paul Dagnelie	5a42ef04fd	Add 'zfs wait' command Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9707	2020-04-01 10:02:06 -07:00
Fabio Scaccabarozzi	c9e3efdb3a	Bugfix/fix uio partial copies In zfs_write(), the loop continues to the next iteration without accounting for partial copies occurring in uiomove_iov when copy_from_user/__copy_from_user_inatomic return a non-zero status. This results in "zfs: accessing past end of object..." in the kernel log, and the write failing. Account for partial copies and update uio struct before returning EFAULT, leave a comment explaining the reason why this is done. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: ilbsmart <wgqimut@gmail.com> Signed-off-by: Fabio Scaccabarozzi <fsvm88@gmail.com> Closes #8673 Closes #10148	2020-04-01 09:48:54 -07:00
Matthew Ahrens	0929c4de39	Improve ZVOL sync write performance by using a taskq == Summary == Prior to this change, sync writes to a zvol are processed serially. This commit makes zvols process concurrently outstanding sync writes in parallel, similar to how reads and async writes are already handled. The result is that the throughput of sync writes is tripled. == Background == When a write comes in for a zvol (e.g. over iscsi), it is processed by calling `zvol_request()` to initiate the operation. ZFS is expected to later call `BIO_END_IO()` when the operation completes (possibly from a different thread). There are a limited number of threads that are available to call `zvol_request()` - one one per iscsi client (unless using MC/S). Therefore, to ensure good performance, the latency of `zvol_request()` is important, so that many i/o operations to the zvol can be processed concurrently. In other words, if the client has multiple outstanding requests to the zvol, the zvol should have multiple outstanding requests to the storage hardware (i.e. issue multiple concurrent `zio_t`'s). For reads, and async writes (i.e. writes which can be acknowledged before the data reaches stable storage), `zvol_request()` achieves low latency by dispatching the bulk of the work (including waiting for i/o to disk) to a taskq. The taskq callback (`zvol_read()` or `zvol_write()`) blocks while waiting for the i/o to disk to complete. The `zvol_taskq` has 32 threads (by default), so we can have up to 32 concurrent i/os to disk in service of requests to zvols. However, for sync writes (i.e. writes which must be persisted to stable storage before they can be acknowledged, by calling `zil_commit()`), `zvol_request()` does not use `zvol_taskq`. Instead it blocks while waiting for the ZIL write to disk to complete. This has the effect of serializing sync writes to each zvol. In other words, each zvol will only process one sync write at a time, waiting for it to be written to the ZIL before accepting the next request. The same issue applies to FLUSH operations, for which `zvol_request()` calls `zil_commit()` directly. == Description of change == This commit changes `zvol_request()` to use `taskq_dispatch_ent(zvol_taskq)` for sync writes, and FLUSh operations. Therefore we can have up to 32 threads (the taskq threads) simultaneously calling `zil_commit()`, for a theoretical performance improvement of up to 32x. To avoid the locking issue described in the comment (which this commit removes), we acquire the rangelock from the taskq callback (e.g. `zvol_write()`) rather than from `zvol_request()`. This applies to all writes (sync and async), reads, and discard operations. This means that multiple simultaneously-outstanding i/o's which access the same block can complete in any order. This was previously thought to be incorrect, but a review of the block device interface requirements revealed that this is fine - the order is inherently not defined. The shorter hold time of the rangelock should also have a slight performance improvement. For an additional slight performance improvement, we use `taskq_dispatch_ent()` instead of `taskq_dispatch()`, which avoids a `kmem_alloc()` and eliminates a failure mode. This applies to all writes (sync and async), reads, and discard operations. == Performance results == We used a zvol as an iscsi target (server) for a Windows initiator (client), with a single connection (the default - i.e. not MC/S). We used `diskspd` to generate a workload with 4 threads, doing 1MB writes to random offsets in the zvol. Without this change we get 231MB/s, and with the change we get 728MB/s, which is 3.15x the original performance. We ran a real-world workload, restoring a MSSQL database, and saw throughput 2.5x the original. We saw more modest performance wins (typically 1.5x-2x) when using MC/S with 4 connections, and with different number of client threads (1, 8, 32). Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10163	2020-03-31 10:50:44 -07:00
Ryan Moeller	9a51738b60	Let default arc_c_max be platform dependent Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10155	2020-03-27 09:14:46 -07:00
Brian Behlendorf	5351951274	Fix zfs_rmnode() unlink / rollback issue If a has rollback has occurred while a file is open and unlinked. Then when the file is closed post rollback it will not exist in the rolled back version of the unlinked object. Therefore, the call to zap_remove_int() may correctly return ENOENT and should be allowed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6812 Closes #9739	2020-03-18 11:47:07 -07:00
Matthew Ahrens	0fdd6106bb	dmu_objset_from_ds must be called with dp_config_rwlock held The normal lock order is that the dp_config_rwlock must be held before the ds_opening_lock. For example, dmu_objset_hold() does this. However, dmu_objset_open_impl() is called with the ds_opening_lock held, and if the dp_config_rwlock is not already held, it will attempt to acquire it. This may lead to deadlock, since the lock order is reversed. Looking at all the callers of dmu_objset_open_impl() (which is principally the callers of dmu_objset_from_ds()), almost all callers already have the dp_config_rwlock. However, there are a few places in the send and receive code paths that do not. For example: dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream, receive_write_byref, redact_traverse_thread. This commit resolves the problem by requiring all callers ot dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the code has been restructured such that we call dmu_objset_from_ds() earlier on in the send and receive processes, when we already have the dp_config_rwlock, and save the objset_t until we need it in the middle of the send or receive (similar to what we already do with the dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in many new places. I also cleaned up code in dmu_redact_snap() and send_traverse_thread(). Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9662 Closes #10115	2020-03-12 10:55:02 -07:00
Matthew Ahrens	b3212d2fa6	Improve performance of zio_taskq_member __zio_execute() calls zio_taskq_member() to determine if we are running in a zio interrupt taskq, in which case we may need to switch to processing this zio in a zio issue taskq. The call to zio_taskq_member() can become a performance bottleneck when we are processing a high rate of zio's. zio_taskq_member() calls taskq_member() on each of the zio interrupt taskqs, of which there are 21. This is slow because each call to taskq_member() does tsd_get(taskq_tsd), which on Linux is relatively slow. This commit improves the performance of zio_taskq_member() by having it cache the value of tsd_get(taskq_tsd), reducing the number of those calls to 1/21th of the current behavior. In a test case running `zfs send -c >/dev/null` of a filesystem with small blocks (average 2.5KB/block), zio_taskq_member() was using 6.7% of one CPU, and with this change it is reduced to 1.3%. Overall time to perform the `zfs send` reduced by 10% (~150,000 block/sec to ~165,000 blocks/sec). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10070	2020-03-03 10:29:38 -08:00
Ryan Moeller	9bb907bc3f	Make spa_history_zone platform-dependent in kernel This function should only return "linux" on Linux. Move the kernel part of the function out of common code. Fix the tests for FreeBSD. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10079	2020-03-02 09:43:30 -08:00
Matthew Macy	ae9f92f6f3	Re-share zfsdev_getminor and zfs_onexit_fd_hold By adding a zfs_file_private accessor to the common interfaces and some extensions to FreeBSD platform code it is now possible to share the implementations for the aforementioned functions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10073	2020-02-28 14:50:32 -08:00
Brian Behlendorf	bd0d24e09b	Linux 5.5 compat: blkg_tryget() Commit https://github.com/torvalds/linux/commit/9e8d42a0f accidentally converted the static inline function blkg_tryget() to GPL-only for kernels built with CONFIG_PREEMPT_RCU=y and CONFIG_BLK_CGROUP=y. Resolve the build issue by providing our own equivalent functionality when needed which uses rcu_read_lock_sched() internally as before. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9745 Closes #10072	2020-02-28 08:58:39 -08:00
Brian Behlendorf	2c3a83701d	Linux 5.6 compat: time_t As part of the Linux kernel's y2038 changes the time_t type has been fully retired. Callers are now required to use the time64_t type. Rather than move to the new type, I've removed the few remaining places where a time_t is used in the kernel code. They've been replaced with a uint64_t which is already how ZFS internally handled these values. Going forward we should work towards updating the remaining user space time_t consumers to the 64-bit interfaces. Reviewed-by: Matthew Macy <mmacy@freebsd.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10052 Closes #10064	2020-02-27 09:31:02 -08:00
Matthew Macy	28caa74b19	Refactor dnode dirty context from dbuf_dirty * Add dedicated donde_set_dirtyctx routine. * Add empty dirty record on destroy assertion. * Make much more extensive use of the SET_ERROR macro. Reviewed-by: Will Andrews <wca@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #9924	2020-02-26 16:09:17 -08:00
Dirkjan Bussink	327000ce04	Remove zfs_getattr and convoff dead code The `convoff` function is called only in one code path in `zfs_space`. Each caller of `zfs_space` is called with a `flock64_t` that has `l_whence` set to `SEEK_SET`. This means that `convoff` always results in a no-op as the `bfp` parameter has `l_whence` set to `SEEK_SET` and `int whence` is `SEEK_SET` as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Closes #10006	2020-02-24 15:38:22 -08:00
DeHackEd	d09dc5980c	Honour sync=disabled when relinking tpmfiles Unlinked files don't respect synchronous flush commands, but when they get relinked their state is unknown. Previously we force flushed all such files even when sync=disabled. Correct this case. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: DHE <git@dehacked.net> Closes #10005	2020-02-16 12:44:08 -08:00
Matthew Ahrens	2adc6b35ae	Missed wakeup when growing kmem cache When growing the size of a (VMEM or KVMEM) kmem cache, spl_cache_grow() always does taskq_dispatch(spl_cache_grow_work), and then waits for the KMC_BIT_GROWING to be cleared by the taskq thread. The taskq thread (spl_cache_grow_work()) does: 1. allocate new slab and add to list 2. wake_up_all(skc_waitq) 3. clear_bit(KMC_BIT_GROWING) Therefore, the waiting thread can wake up before GROWING has been cleared. It will see that the growing has not yet completed, and go back to sleep until it hits the 100ms timeout. This can have an extreme performance impact on workloads that alloc/free more than fits in the (statically-sized) magazines. These workloads allocate and free slabs with high frequency. The problem can be observed with `funclatency spl_cache_grow`, which on some workloads shows that 99.5% of the time it takes <64us to allocate slabs, but we spend ~70% of our time in outliers, waiting for the 100ms timeout. The fix is to do `clear_bit(KMC_BIT_GROWING)` before `wake_up_all(skc_waitq)`. A future investigation should evaluate if we still actually need to taskq_dispatch() at all, and if so on which kernel versions. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9989	2020-02-13 11:23:02 -08:00

1 2 3 4 5 ...

319 Commits