mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Rich Ercolani	269b5dadcf	Enable edonr in FreeBSD The code is integrated, builds fine, runs fine, there's not really any reason not to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12735	2021-11-16 12:40:10 -07:00
Martin Matuška	b8dcfb2c9f	FreeBSD: fix world build after `de198f2d9` The inline function vn_flush_cached_data() in vnode.h must not be compiled when building BASE. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #12743	2021-11-15 09:07:39 -07:00
George Amanakis	c9d62d1356	Introduce a tunable to exclude special class buffers from L2ARC Special allocation class or dedup vdevs may have roughly the same performance as L2ARC vdevs. Introduce a new tunable to exclude those buffers from being cacheable on L2ARC. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11761 Closes #12285	2021-11-11 12:52:16 -08:00
Fedor Uporov	49d42425d6	Check l2cache vdevs pending list inside the vdev_inuse() The l2cache device could be added twice because vdev_inuse() does not check spa_l2cache for added devices. Make l2cache vdevs inuse checking logic more closer to spare vdevs. Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9153 Closes #12689	2021-11-11 11:54:15 -08:00
Fedor Uporov	e39fe05b69	Skip spacemaps reading in case of pool readonly import The only zdb utility require to read metaslab-related data during read-only pool import because of spacemaps validation. Add global variable which will allow zdb read spacemaps in case of readonly import mode. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9095 Closes #12687	2021-11-09 12:50:39 -08:00
Brian Behlendorf	1e7d634867	Linux 5.16 compat: linux/elevator.h Commit https://github.com/torvalds/linux/commit/2e9bc346 moved the elevator.h header under the block/ directory as part of some refactoring. This turns out not to be a problem since there's no longer anything we need from the header. This has been the case for some time, this change removes the elevator.h include and replaces it with a major.h include. Reviewed-by: Alexander Lobakin <alobakin@pm.me> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12725	2021-11-09 11:26:35 -08:00
Brian Behlendorf	de198f2d95	Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency When using lseek(2) to report data/holes memory mapped regions of the file were ignored. This could result in incorrect results. To handle this zfs_holey_common() was updated to asynchronously writeback any dirty mmap(2) regions prior to reporting holes. Additionally, while not strictly required, the dn_struct_rwlock is now held over the dirty check to prevent the dnode structure from changing. This ensures that a clean dnode can't be dirtied before the data/hole is located. The range lock is now also taken to ensure the call cannot race with zfs_write(). Furthermore, the code was refactored to provide a dnode_is_dirty() helper function which checks the dnode for any dirty records to determine its dirtiness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12724	2021-11-07 14:27:44 -07:00
Damian Szuberski	6d680e61ef	Update `checkstyle` workflow env to ubuntu-20.04 - `checkstyle` workflow uses ubuntu-20.04 environment - improved `mancheck.sh` readability Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12713	2021-11-02 14:02:57 -06:00
Fedor Uporov	d5a5ec4693	Remove unused function zvol_set_volblocksize() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #12688	2021-10-26 17:07:53 -07:00
Pawel Jakub Dawidek	afbc617921	Remove FreeBSD's local copy of the dmu_buf_hold_array() function Make the main dmu_buf_hold_array() function non-static. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12628	2021-10-13 11:01:01 -07:00
Teodor Spæren	d785245857	zio: use unsigned values for enum cppcheck complains about the use of 1 << 31, because enums are signed ints which cannot represent this. As discussed in issue #12611, it appears that with C99, we can use an unsiged int for the enum, on most platforms. I've crafted this commit for just the include/sys/zio.h header, as it's the only one with a shift of 31. If this is something we want to adopt in the rest of the project, I will go through and apply it to the rest of the project. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Teodor Spæren <teodor@sparen.no> Closes #12611 Closes #12615	2021-10-11 10:58:06 -07:00
Rich Ercolani	9d1407e8f2	Correct refcount_add in dmu_zfetch refcount_add_many(foo,N) is not the same as for (i=0; i < N; i++) { refcount_add(foo); } Unfortunately, this is only actually true with debug kernels and reference_tracking_enable=1. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12589 Closes #12602	2021-10-08 11:10:34 -07:00
Brian Behlendorf	514498fef6	Simplify and document OpenZFS library dependencies For those not already familiar with the code base it can be a challenge to understand how the libraries are laid out. This has sometimes resulted in functionality being added in the wrong place. To help avoid that in the future this commit documents the high-level dependencies for easy reference in lib/Makefile.am. It also simplifies a few things. - Switched libzpool dependency on libzfs_core to libzutil. This change makes it clear libzpool should never depend on the ioctl() functionality provided by libzfs_core. - Moved zfs_ioctl_fd() from libzutil to libzfs_core and renamed it lzc_ioctl_fd(). Normal access to the kmods should all be funneled through the libzfs_core library. The sole exception is the pool_active() which was updated to not use lzc_ioctl_fd() to remove the libzfs_core dependency. - Removed libzfs_core dependency on libzutil. - Removed the lib/libzfs/os/freebsd/libzfs_ioctl_compat.c source file which was all dead code. - Removed libzfs_core dependency from mkbusy and ctime test utilities. It was only needed for some trivial wrapper functions and that code is easy to replicate to shed the unneeded dependency. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12602	2021-10-07 11:31:26 -06:00
Tony Hutter	2a8430a260	Rescan enclosure sysfs path on import When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the enclosure sysfs path to the fault LEDs, like: vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8 However, this enclosure path doesn't get updated on successive imports even if enclosure path to the disk changes. This patch fixes the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #11950 Closes #12095	2021-10-04 12:32:16 -07:00
Jorgen Lundman	1d901c3ee5	Upstream: unmount snapshots before destroying them on macOS Add function zfs_destroy_snaps_nvl_os() call. The main issue is that macOS needs to unmount any mounted snapshots before they can be destroyed. Other platforms can handle this in the kernel, but sending a storm of zed events to unmount seems undesirable when we can do it in userland to start with. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: ilovezfs <ilovezfs@icloud.com> Closes #12550	2021-09-20 09:29:59 -06:00
Brian Behlendorf	6954c22f35	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-09-14 10:17:54 -06:00
Jorgen Lundman	5a54a4e051	Upstream: Add snapshot and zvol events For kernel to send snapshot mount/unmount events to zed. For kernel to send symlink creates/removes on zvol plumbing. (/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX) If zed misses the ENODEV, all errors after are EINVAL. Treat any error as kernel module failure. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12416	2021-09-09 10:44:21 -07:00
Brian Behlendorf	2079111f42	Linux 5.15 compat: get_acl() Kernel commits 332f606b32b6 ovl: enable RCU'd ->get_acl() 0cad6246621b vfs: add rcu argument to ->get_acl() callback Added compatibility code to detect the new ->get_acl() interface and correctly handle the case where the new rcu argument is set. Reviewed-by: Coleman Kane <ckane@colemankane.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12548	2021-09-09 09:38:35 -07:00
Alexander	a3588c68f7	Linux 5.15 compat: standalone <linux/stdarg.h> Kernel commits 39f75da7bcc8 ("isystem: trim/fixup stdarg.h and other headers") c0891ac15f04 ("isystem: ship and use stdarg.h") 564f963eabd1 ("isystem: delete global -isystem compile option") (for now can be found in linux-next.git tree, will land into the Linus' tree during the ongoing 5.15 cycle with one of akpm merges) removed the -isystem flag and disallowed the inclusion of any compiler header files. They also introduced a minimal <linux/stdarg.h> as a replacement for <stdarg.h>. include/os/linux/spl/sys/cmn_err.h in the ZFS source tree includes <stdarg.h> unconditionally. Introduce a test for <linux/stdarg.h> and include it instead of the compiler's one to prevent module build breakage. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12531	2021-09-08 12:59:43 -07:00
Brian Behlendorf	f616605821	Linux 5.15 compat: block device readahead The 5.15 kernel moved the backing_dev_info structure out of the request queue structure which causes a build failure. Rather than look in the new location for the BDI we instead detect this upstream refactoring by the existance of either the blk_queue_update_readahead() or disk_update_readahead() functions. In either case, there's no longer any reason to manually set the ra_pages value since it will be overridden with a reasonable default (2x the block size) when blk_queue_io_opt() is called. Therefore, we update the compatibility wrapper to do nothing for 5.9 and newer kernels. While it's tempting to do the same for older kernels we want to keep the compatibility code to preserve the existing behavior. Removing it would effectively increase the default readahead to 128k. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12532	2021-09-08 09:03:13 -06:00
Jorgen Lundman	3e8d5e4ff3	Add zpool_disable_datasets_os() / zfs_unmount_os() zpool_disable_datasets_os(): macOS needs to do a bunch of work to kick everything off zvols. zfs_unmount_os(): This allows us to unmount any zvols that may be mounted. Like with zfs destroy foo/vol Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12436	2021-08-31 09:56:00 -06:00
Rich Ercolani	b1a1c64313	Fix cross-endian interoperability of zstd It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12008 Closes #12022	2021-08-30 14:13:46 -07:00
Trevor Bautista	00888c0898	Extend zpool-iostat to account for ZIO_PRIORITY_REBUILD (#12319 ) Previously, zpool-iostat did not display any data regarding rebuild I/Os in either the latency/size histograms (-w/-l/-r) or the queue data (-q). This fix essentially utilizes the existing infrastructure for tracking rebuild queue data and displays this data in the proper places within zpool-iostat's output. Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Signed-off-by: Trevor Bautista <tbautista@lanl.gov> Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2021-08-26 11:26:49 -07:00
Richard Yao	abbf0bd4eb	Linux 4.11 compat: statx support Linux 4.11 added a new statx system call that allows us to expose crtime as btime. We do this by caching crtime in the znode to match how atime, ctime and mtime are cached in the inode. statx also introduced a new way of reporting whether the immutable, append and nodump bits have been set. It adds support for reporting compression and encryption, but the semantics on other filesystems is not just to report compression/encryption, but to allow it to be turned on/off at the file level. We do not support that. We could implement semantics where we refuse to allow user modification of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc() to find out encryption/compression information. That would introduce locking that will have a minor (although unmeasured) performance cost. It also would be inferior to zdb, which reports far more detailed information. We therefore omit reporting of encryption/compression through statx in favor of recommending that users interested in such information use zdb. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #8507	2021-08-17 10:25:58 -07:00
Alexander Motin	6b88b4b501	Remove b_pabd/b_rabd allocation from arc_hdr_alloc() When a header is allocated for full overwrite it is a waste of time to allocate b_pabd/b_rabd for it, since arc_write() will free them without ever being touched. If it is a read or a partial overwrite then arc_read() and arc_hdr_decrypt() allocate them explicitly. Reduced memory allocation in user threads also reduces ARC eviction throttling there, proportionally increasing it in ZIO threads, that is not good. To minimize or even avoid it introduce ARC allocation reserve, allowing certain arc_get_data_abd() callers to allocate a bit longer in situations where user threads will already throttle. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12398	2021-08-17 10:15:54 -06:00
Alexander Motin	72f0521aba	Increase default volblocksize from 8KB to 16KB Many things has changed since previous default was set many years ago. Nowadays 8KB does not allow adequate compression or even decent space efficiency on many of pools due to 4KB disk physical block rounding, especially on RAIDZ and DRAID. It effectively limits write throughput to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation, vdev queue and other block rate bottlenecks. It keeps L2ARC expensive despite many optimizations and dedup just unrealistic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12406	2021-08-17 09:59:46 -06:00
Alexander Motin	cfe8e960f1	Fix/improve dbuf hits accounting Instead of clearing stats inside arc_buf_alloc_impl() do it inside arc_hdr_alloc() and arc_release(). It fixes statistics being wiped every time a new dbuf is filled from the ARC. Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits. Since the hits are accounted under hash lock, replace atomics with simple increments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12422	2021-08-17 09:50:31 -06:00
Alexander Motin	e829a865bf	Use more atomics in refcounts Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12420	2021-08-17 09:44:34 -06:00
Allan Jude	e945e8d7f4	Restore FreeBSD sysctl processing for arc.min and arc.max Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #12161	2021-08-16 09:35:19 -06:00
Tony Nguyen	6bc61d22c4	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-08-10 11:36:26 -06:00
Alexander Motin	7eebcd2be6	Avoid small buffer copying on write It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12425	2021-07-27 16:05:47 -07:00
Brian Behlendorf	4bd99c11d7	Remove overlooked __sun_attr__ based macros The __NORETURN, __CONST, and __PURE macros in the FreeBSD platform code were based on the __sun_attr__ macro which was removed in commit `5dbf6c5a6`. This caused a build failure because the __NORETURN macro was still used in one place in kernel code. The __CONST and __PURE macros were entirely unused. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12435	2021-07-27 09:49:11 -07:00
наб	037af3e0d4	Remove NOTE(CONSTCOND) and note.h These were mostly used to annotate do {} while(0)s Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:53 -07:00
наб	5dbf6c5a66	Replace /PRINTFLIKEn/ with attribute(printf) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:15 -07:00
Alexander Motin	1b50749ce9	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12314	2021-07-21 06:40:36 -06:00
Kevin Jin	a7bd20e309	Add Module Parameter Regarding Log Size Limit * Add Module Parameters Regarding Log Size Limit zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12284	2021-07-20 09:40:24 -06:00
Alexander Motin	8172df643b	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-07-20 08:13:21 -06:00
Alexander	23c13c7e80	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-07-20 08:03:33 -06:00
Kevin Bowling	ca14e08cbf	Detect HAVE_LARGE_STACKS at compile time Move HAVE_LARGE_STACKS definitions to header and set when appropriate. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Kevin Bowling <kbowling@FreeBSD.org> Closes #12350	2021-07-16 14:28:55 -06:00
Alexander Motin	c1b5869bab	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12300	2021-07-16 13:39:24 -06:00
Alexander Motin	f7de776da2	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12279	2021-07-13 09:41:59 -06:00
George Wilson	958826be7a	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-07-10 19:00:37 -06:00
Jorgen Lundman	03dba7ae31	dprintf_dnode: strcpy -> strlcpy Missed a couple of strcpy() in earlier commit, this is only used with --enable-debug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12311	2021-07-07 21:13:40 -06:00
Alexander Motin	bdd11cbb90	FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12328	2021-07-06 17:39:23 -07:00
Alexander Motin	b192a2c0a1	Remove avl_size field from struct avl_tree This field is used only by illumos mdb. On other platforms it only increases the struct size from 32 to 40 bytes. For struct vdev_queue including 13 instances of avl_tree_t size means active cache lines. Keep the padding in user-space for now to not break the ABI. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12290	2021-07-01 09:32:31 -06:00
Alexander Motin	490c845efe	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12289	2021-07-01 09:30:31 -06:00
Jorgen Lundman	c6d1112bf4	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-07-01 09:28:15 -06:00
Kevin Jin	50e09eddd0	Optimize txg_kick() process (#12274 ) Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12274	2021-07-01 09:20:27 -06:00
Alexander Motin	42afb12da7	Remove refcount from spa_config_() The only reason for spa_config_() to use refcount instead of simple non-atomic (thanks to scl_lock) variable for scl_count is tracking, hard disabled for the last 8 years. Switch to simple int scl_count reduces the lock hold time by avoiding atomic, plus makes structure fit into single cache line, reducing the locks contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12287	2021-07-01 09:16:54 -06:00
Martin Matuška	14d2841b53	FreeBSD: fix compilation of FreeBSD world after `29274c9f6` prng32_bounded() is available to kernel only on FreeBSD 13+. Call inline random_get_pseudo_bytes() with correct pointer type. To be consistent, apply to Linux as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #12282	2021-06-25 10:28:51 -07:00

1 2 3 4 5 ...

1890 Commits