mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-15 02:37:32 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	34dbc618f5	Reduce dbuf_find() lock contention Holding a dbuf is a common operation which can become highly contended in dbuf_find() when acquiring the dbuf hash mutex. This is particularly true on Linux when reading/writing volumes since by default up to 32 threads from the zvol_taskq may be taking a hold of the same dbuf. This should also be observable on FreeBSD as long as there are enough processes accessing the volume concurrently. This is further aggregrated by the fact that only the block id will be unique when calculating the dbuf hash for a single volume. The objset id, object id, and level will be the same for data blocks. This has been observed to result in a somehwat less than uniform hash distribution and a longer than expected max hash chain depth (~20) on a large memory system (256 GB) using volumes. This commit improves the siutation by switching the hash mutex to an rwlock to allow concurrent lookups, and increasing DBUF_RWLOCKS from 2048 to 8192 to further reduce the odds of a hash collision. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13405	2022-05-04 11:17:29 -07:00
Shaan Nobee	411f4a018d	Speed up WB_SYNC_NONE when a WB_SYNC_ALL occurs simultaneously Page writebacks with WB_SYNC_NONE can take several seconds to complete since they wait for the transaction group to close before being committed. This is usually not a problem since the caller does not need to wait. However, if we're simultaneously doing a writeback with WB_SYNC_ALL (e.g via msync), the latter can block for several seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE writeback since it needs to wait for the transaction to complete and the PG_writeback bit to be cleared. This commit deals with 2 cases: - No page writeback is active. A WB_SYNC_ALL page writeback starts and even completes. But when it's about to check if the PG_writeback bit has been cleared, another writeback with WB_SYNC_NONE starts. The sync page writeback ends up waiting for the non-sync page writeback to complete. - A page writeback with WB_SYNC_NONE is already active when a WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up waiting for the WB_SYNC_NONE writeback. The fix works by carefully keeping track of active sync/non-sync writebacks and committing when beneficial. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaan Nobee <sniper111@gmail.com> Closes #12662 Closes #12790	2022-05-03 13:23:26 -07:00
Pawel Jakub Dawidek	a64d757aa4	FreeBSD: Clean up the use of ioflags - Prefer O_* flags over F* flags that mostly mirror O_* flags anyway, but O_* flags seem to be preferred. - Simplify the code as all the F*SYNC flags were defined as FFSYNC flag. - Don't define FRSYNC flag, so we don't generate unnecessary ZIL commits. - Remove EXCL define, FreeBSD ignores the excl argument for zfs_create() anyway. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13400	2022-05-02 16:26:28 -07:00
Alexander Motin	600a02b884	Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12789	2022-04-26 10:44:21 -07:00
George Amanakis	0409d33273	Improve zpool status output, list all affected datasets Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of #9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9175 Closes #12812	2022-04-25 17:25:42 -07:00
наб	ad9e767657	linux: module: weld all but spl.ko into zfs.ko Originally it was thought it would be useful to split up the kmods by functionality. This would allow external consumers to only load what was needed. However, in practice we've never had a case where this functionality would be needed, and conversely managing multiple kmods can be awkward. Therefore, this change merges all but the spl.ko kmod in to a single zfs.ko kmod. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13274	2022-04-20 13:28:24 -07:00
Mark Johnston	e9084d0712	FreeBSD: Parameterize ZFS_ENTER/ZFS_VERIFY_VP with an error code For legacy reasons, a couple of VOPs have to return error numbers that don't come from the usual errno namespace. To handle the cases where ZFS_ENTER or ZFS_VERIFY_ZP fail, we need to be able to override the default error return value of EIO. Extend the macros to permit this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #13311	2022-04-13 09:42:51 -07:00
Brian Behlendorf	460748d4ae	Switch from _Noreturn to __attribute__((noreturn)) Parts of the Linux kernel build system struggle with _Noreturn. This results in the following warnings when building on RHEL 8.5, and likely other environments. Switch to using the __attribute__((noreturn)). warning: objtool: dbuf_free_range()+0x2b8: return with modified stack frame warning: objtool: dbuf_free_range()+0x0: stack state mismatch: cfa1=7+40 cfa2=7+8 ... WARNING: EXPORT symbol "arc_buf_size" [zfs.ko] version generation failed, symbol will not be versioned. WARNING: EXPORT symbol "spa_open" [zfs.ko] version generation failed, symbol will not be versioned. ... Additionally, __thread_exit() has been renamed spl_thread_exit() and made a static inline function. This was needed because the kernel will generate a warning for symbols which are __attribute__((noreturn)) and then exported with EXPORT_SYMBOL. While we could continue to use _Noreturn in user space I've also switched it to __attribute__((noreturn)) purely for consistency throughout the code base. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13238	2022-03-23 08:51:00 -07:00
Brian Behlendorf	6b444cb971	Linux 5.16 compat: restore FSR and FSAVE Commit `3b52ccd` introduced a flaw where FSR and FSAVE are not restored when using a Linux 5.16 kernel. These instructions are only used when XSAVE is not supported by the processor meaning only some systems will encounter this issue. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13210 Closes #13236	2022-03-19 12:46:33 -07:00
Ryan Moeller	d42979c6ef	Fix ACL checks for NFS kernel server This PR changes ZFS ACL checks to evaluate fsuid / fsgid rather than euid / egid to avoid accidentally granting elevated permissions to NFS clients. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Andrew Walker <awalker@ixsystems.com> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #13221	2022-03-18 06:47:57 -06:00
наб	a9e2b22efb	Integrate carcass of libspl/i/s/vtoc.h into i/s/efi_partition.h Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:54 -07:00
наб	d465fc5844	Forbid b{copy,zero,cmp}(). Don't include <strings.h> for <string.h> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:48 -07:00
наб	861166b027	Remove bcopy(), bzero(), bcmp() bcopy() has a confusing argument order and is actually a move, not a copy; they're all deprecated since POSIX.1-2001 and removed in -2008, and we shim them out to mem*() on Linux anyway Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:42 -07:00
наб	1d77d62f5a	libspl: include: sys/vtoc.h: reduce to absolute barest minimum Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12996	2022-03-15 15:13:36 -07:00
Akash B	1282274f33	Add physical device size to SIZE column in 'zpool list -v' Add physical device size/capacity only for physical devices in 'zpool list -v' instead of displaying "-" in the SIZE column. This would make it easier to see the individual device capacity and to determine which spares are large enough to replace which devices. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com> Signed-off-by: Akash B <akash-b@hpe.com> Closes #12561 Closes #13106	2022-03-08 16:20:41 -08:00
Attila Fülöp	ce7a5dbf4b	Linux x86 SIMD: factor out unneeded kernel dependencies Cleanup the kernel SIMD code by removing kernel dependencies. - Replace XSTATE_XSAVE with our own XSAVE implementation for all kernels not exporting kernel_fpu{begin,end}(), see #13059 - Replace union fpregs_state by a uint8_t * buffer and get the size of the buffer from the hardware via the CPUID instruction - Replace kernels xgetbv() by our own implementation which was already there for userspace. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13102	2022-03-08 16:19:15 -08:00
наб	3a909fe33e	libzfs, libzfs_core: send: always write to pipe By introducing lzc_send_wrapper() and routing all ZFS_IOC_SEND* users through it, we fix a Linux 5.10-introduced bug (see comment) This is all /transparent/ to the users API, ABI, and usage-wise, and disabled on FreeBSD and if the output is already a pipe, and transparently nestable (i.e. zfs_send_one() is wrapped, but so is lzc_send_redacted() it calls to ‒ this wouldn't be strictly necessary if ZFS_IOC_SEND_PROGRESS wasn't strictly denominational w.r.t. the descriptor the send is happening on) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Co-authored-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #11445 Closes #13133	2022-03-08 09:33:08 -08:00
Brian Behlendorf	6df43169b3	Fix ENOSPC when unlinking multiple files from full pool When unlinking multiple files from a pool at 100% capacity, it was possible for ENOSPC to be returned after the first unlink. e.g. rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0 rm: cannot remove '/mnt/fs/test1.1.0': No space left on device rm: cannot remove '/mnt/fs/test1.2.0': No space left on device After waiting for the pending deferred frees from the first unlink to be processed the remaining files can then be unlinked. This is caused by the quota limit in dsl_dir_tempreserve_impl() being temporarily decreased to the allocatable pool capacity less any deferred free space. This is resolved using the existing mechanism of returning ERESTART when over quota as long as we know enough space will shortly be available after processing the pending deferred frees. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13172	2022-03-08 09:16:35 -08:00
Umer Saleem	39a4daf742	Expose additional file level attributes ZFS allows to update and retrieve additional file level attributes for FreeBSD. This commit allows additional file level attributes to be updated and retrieved for Linux. These include the flags stored in the upper half of z_pflags only. Two new IOCTLs have been added for this purpose. ZFS_IOC_GETDOSFLAGS can be used to retrieve the attributes, while ZFS_IOC_SETDOSFLAGS can be used to update the attributes. Attributes that are allowed to be updated include ZFS_IMMUTABLE, ZFS_APPENDONLY, ZFS_NOUNLINK, ZFS_ARCHIVE, ZFS_NODUMP, ZFS_SYSTEM, ZFS_HIDDEN, ZFS_READONLY, ZFS_REPARSE, ZFS_OFFLINE and ZFS_SPARSE. Flags can be or'd together while calling ZFS_IOC_SETDOSFLAGS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes #13118	2022-03-07 17:52:03 -08:00
Windel Bouwman	9955b9ba2e	Handle aarch64 defines seperate from arm aarch64 is a different architecture than arm. Some compilers might choke when both __arm__ and __aarch64__ are defined. This change separates the checks for arm and for aarch64 in the isa_defs.h header files. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Windel Bouwman <windel@windel.nl> Closes #10335 Closes #13151	2022-03-07 17:49:34 -08:00
Alejandro Colomar	db7f1a91de	Use _Noreturn (C11; GNU89) properly A function that returns with no value is a different thing from a function that doesn't return at all. Those are two orthogonal concepts, commonly confused. pthread_create(3) expects a pointer to a start routine that has a very precise prototype: void (start_routine)(void ); However, other thread functions, such as kernel ones, expect: void (start_routine)(void ); Providing a different one is incorrect, and has only been working because the ABIs happen to produce a compatible function. We should use '_Noreturn void', since it's the natural type, and then provide a '_Noreturn void ' wrapper for pthread functions. For consistency, replace most cases of __NORETURN or __attribute__((noreturn)) by _Noreturn. _Noreturn is understood by -std=gnu89, so it should be safe to use everywhere. Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136 Ref: https://software.codidact.com/posts/285972 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com> Closes #13120	2022-03-04 16:25:22 -08:00
наб	be8e1d81bf	Flex non-pretty-printed properties and raw-/pretty-print remaining ones Before: nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null Sun 20 Feb 03:57:44 CET 2022 NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1048576 16384 tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1048576 512 tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1048576 16384 nabijaczleweli@tarta:~/store/code/zfs$ /sbin/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO tarta-zoot 1110930838977259561 659P - 1 54 0 1.03x tarta-zoot/PAGEFILE.SYS 2202570496672997800 3.20E - 2163 1539 0 1.07x tarta-zoot/dupa 16941280502417785695 9.81E - 2274707 1322 1000000000000 1.00x tarta-zoot/etc 17029963068508333530 12.9E - 3663 1087 0 1.52x tarta-zoot/home 3508163802370032575 8.50E - 3664 294 0 1.00x tarta-zoot/home/misio 7283672744014848555 13.0E - 3665 302 0 2.28x tarta-zoot/home/nabijaczleweli 12286744508078616303 5.15E - 3666 200 0 2.05x tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5.16E - 3667 1095 0 1.00x tarta-zoot/home/root 5203106193060067946 15.4E - 3668 698 0 2.86x tarta-zoot/home/shared-config 8866040021005142194 14.5E - 3670 2069 0 1.20x tarta-zoot/home/tymek 9472751824283011822 4.56E - 3671 1202 0 1.32x tarta-zoot/oldboot 10460192444135730377 13.8E - 2268398 1232 0 1.01x tarta-zoot/opt 9945621324983170410 5.84E - 3672 1210 0 1.00x tarta-zoot/opt/icecc 13178238931846132425 9.04E - 3673 1103 0 2.83x tarta-zoot/opt/swtpm 10172962421514870859 4.13E - 825669 145132 0 1.87x tarta-zoot/srv 217179989022738337 3.90E - 3674 2469 0 1.00x tarta-zoot/usr 12214213243060765090 15.0E - 3675 2477 0 2.58x tarta-zoot/usr/local 7542700368693813134 941P - 3676 2484 0 2.33x tarta-zoot/var 13414177124447929530 10.2E - 3677 2492 0 1.57x tarta-zoot/var/lib 6969944550407159241 5.28E - 3678 2499 0 2.34x tarta-zoot/var/tmp 6399468088048343912 1.34E - 3679 1218 0 3.95x After: nabijaczleweli@tarta:~/store/code/zfs$ cmd/zpool/zpool list -Td -o name,size,alloc,free,ckpoint,expandsz,guid,load_guid,frag,cap,dedup,health,altroot,guid,dedupditto,load_guid,maxblocksize,maxdnodesize 2>/dev/null Sun 20 Feb 03:57:42 CET 2022 NAME SIZE ALLOC FREE CKPOINT EXPANDSZ GUID LOAD_GUID FRAG CAP DEDUP HEALTH ALTROOT GUID DEDUPDITTO LOAD_GUID MAXBLOCKSIZE MAXDNODESIZE filling 25.5T 6.52T 18.9T - 64M 11512889483096932869 11656109927366648364 1% 25% 1.00x ONLINE - 11512889483096932869 0 11656109927366648364 1M 16K tarta-boot 240M 50.6M 189M - - 2372068846917849656 7752280792179633787 12% 21% 1.00x ONLINE - 2372068846917849656 0 7752280792179633787 1M 512 tarta-zoot 55.5G 6.42G 49.1G - - 12971868889665384604 8622632123393589527 17% 11% 1.00x ONLINE - 12971868889665384604 0 8622632123393589527 1M 16K nabijaczleweli@tarta:~/store/code/zfs$ cmd/zfs/zfs list -o name,guid,keyguid,ivsetguid,createtxg,objsetid,pbkdf2iters,refratio -r tarta-zoot NAME GUID KEYGUID IVSETGUID CREATETXG OBJSETID PBKDF2ITERS REFRATIO tarta-zoot 1110930838977259561 741529699813639505 - 1 54 0 1.03x tarta-zoot/PAGEFILE.SYS 2202570496672997800 3689529982640017884 - 2163 1539 0 1.07x tarta-zoot/dupa 16941280502417785695 11312442953423259518 - 2274707 1322 1000000000000 1.00x tarta-zoot/etc 17029963068508333530 14852574366795347233 - 3663 1087 0 1.52x tarta-zoot/home 3508163802370032575 9802810070759776956 - 3664 294 0 1.00x tarta-zoot/home/misio 7283672744014848555 14983161489316798151 - 3665 302 0 2.28x tarta-zoot/home/nabijaczleweli 12286744508078616303 5937870537299886218 - 3666 200 0 2.05x tarta-zoot/home/nabijaczleweli/tftp 13551632689932817643 5950522828900813054 - 3667 1095 0 1.00x tarta-zoot/home/root 5203106193060067946 17718025091255443518 - 3668 698 0 2.86x tarta-zoot/home/shared-config 8866040021005142194 16716354482778968577 - 3670 2069 0 1.20x tarta-zoot/home/tymek 9472751824283011822 5251854710505749954 - 3671 1202 0 1.32x tarta-zoot/oldboot 10460192444135730377 15894065034622168157 - 2268398 1232 0 1.01x tarta-zoot/opt 9945621324983170410 6737735639539098405 - 3672 1210 0 1.00x tarta-zoot/opt/icecc 13178238931846132425 10425145983015238428 - 3673 1103 0 2.83x tarta-zoot/opt/swtpm 10172962421514870859 4764783754852521469 - 825669 145132 0 1.87x tarta-zoot/srv 217179989022738337 4492810461439647259 - 3674 2469 0 1.00x tarta-zoot/usr 12214213243060765090 17306702395865262834 - 3675 2477 0 2.58x tarta-zoot/usr/local 7542700368693813134 1059954157997659784 - 3676 2484 0 2.33x tarta-zoot/var 13414177124447929530 11764397504176937123 - 3677 2492 0 1.57x tarta-zoot/var/lib 6969944550407159241 6084753728494937404 - 3678 2499 0 2.34x tarta-zoot/var/tmp 6399468088048343912 1548692824635344277 - 3679 1218 0 3.95x Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13122 Closes #13125	2022-03-04 12:08:33 -08:00
Rich Ercolani	56fa4aa96e	Default to ON for compression A simple change, but so many tests break with it, and those are the majority of this. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #13078	2022-03-03 10:43:38 -08:00
Jitendra Patidar	361a7e8211	log xattr=sa create/remove/update to ZIL As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] `da6c28aaf6` Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes #8768 Closes #9078	2022-02-22 13:06:43 -08:00
Damian Szuberski	806739f991	Correct compilation errors reported by GCC 10/11 New `zfs_type_t` value `ZFS_TYPE_INVALID` is introduced. Variable initialization is now possible to make GCC happy. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12167 Closes #13103	2022-02-20 19:20:00 -08:00
Ryan Moeller	5c0061345b	Cross-platform xattr user namespace compatibility ZFS on Linux originally implemented xattr namespaces in a way that is incompatible with other operating systems. On illumos, xattrs do not have namespaces. Every xattr name is visible. FreeBSD has two universally defined namespaces: EXTATTR_NAMESPACE_USER and EXTATTR_NAMESPACE_SYSTEM. The system namespace is used for protected FreeBSD-specific attributes such as MAC labels and pnfs state. These attributes have the namespace string "freebsd:system:" prefixed to the name in the encoding scheme used by ZFS. The user namespace is used for general purpose user attributes and obeys normal access control mechanisms. These attributes have no namespace string prefixed, so xattrs written on illumos are accessible in the user namespace on FreeBSD, and xattrs written to the user namespace on FreeBSD are accessible by the same name on illumos. Linux has several xattr namespaces. On Linux, ZFS encodes the namespace in the xattr name for every namespace, including the user namespace. As a consequence, an xattr in the user namespace with the name "foo" is stored by ZFS with the name "user.foo" and therefore appears on FreeBSD and illumos to have the name "user.foo" rather than "foo". Conversely, none of the xattrs written on FreeBSD or illumos are accessible on Linux unless the name happens to be prefixed with one of the Linux xattr namespaces, in which case the namespace is stripped from the name. This makes xattrs entirely incompatible between Linux and other platforms. We want to make the encoding of user namespace xattrs compatible across platforms. A critical requirement of this compatibility is for xattrs from existing pools from FreeBSD and illumos to be accessible by the same names in the user namespace on Linux. It is also necessary that existing pools with xattrs written by Linux retain access to those xattrs by the same names on Linux. Making user namespace xattrs from Linux accessible by the correct names on other platforms is important. The handling of other namespaces is not required to be consistent. Add a fallback mechanism for listing and getting xattrs to treat xattrs as being in the user namespace if they do not match a known prefix. Do not allow setting or getting xattrs with a name that is prefixed with one of the namespace names used by ZFS on supported platforms. Allow choosing between legacy illumos and FreeBSD compatibility and legacy Linux compatibility with a new tunable. This facilitates replication and migration of pools between hosts with different compatibility needs. The tunable controls whether or not to prefix the namespace to the name. If the xattr is already present with the alternate prefix, remove it so only the new version persists. By default the platform's existing convention is used. Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11919	2022-02-15 16:35:30 -08:00
наб	de0ec5e7df	module: icp: remove vestigia of crypto sessions Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:56 -08:00
наб	cf497e18df	module: icp: remove unused (and mostly faked) cm_{{min,max}_key_length,mech_flags} Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:52 -08:00
наб	df7b54f1d9	module: icp: rip out insane crypto_req_handle_t mechanism, inline KM_SLEEP Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:37 -08:00
наб	15ec086396	include: crypto: clean out api.h Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:32 -08:00
наб	1949be46c3	include: crypto: clean out unused SYSCALL32 and flags Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:24 -08:00
наб	f43748f6e1	module: icp: remove algorithm name defines used only in the default mechtab Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:21 -08:00
наб	d223af9bbc	include: crypto: remove unused algorithm name defines Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:17 -08:00
наб	64e82cea13	module: icp: remove set-but-unused cd_miscdata Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:13 -08:00
наб	739afd9475	module: icp: fold away all key formats except CRYPTO_KEY_RAW It's the only one actually used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:07 -08:00
наб	1018e81e30	module: icp: remove unused CRYPTO_* error codes Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:25:03 -08:00
наб	d77702035a	module: icp: remove unused CRYPTO_{NOTIFY_OPDONE,SKIP_REQID,RESTRICTED} Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:24:24 -08:00
наб	eb1e09b7ec	module: icp: remove unused CRYPTO_ALWAYS_QUEUE Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:24:19 -08:00
наб	65a613b70d	module: icp: remove unused kcf_digest.c Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:24:14 -08:00
наб	bf3fffe70d	module: icp: remove unused kcf_mac operations Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:24:04 -08:00
наб	2c2f955aae	module: icp: remove unused kcf_cipher operations Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:23:59 -08:00
наб	710657f51d	module: icp: remove other provider types Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:23:53 -08:00
наб	464700ae02	module: icp: spi: crypto_ops_t: remove unused op types Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12901	2022-02-15 16:23:28 -08:00
Jorgen Lundman	4759342a5e	Add spa _os() hooks Add hooks for when spa is created, exported, activated and deactivated. Used by macOS to attach iokit, and lock kext as busy (to stop unloads). Userland, Linux, and, FreeBSD have empty stubs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12801	2022-02-15 15:54:25 -08:00
Jorgen Lundman	9a70e97fe1	Rename fallthrough to zfs_fallthrough Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #13097	2022-02-15 08:58:59 -08:00
Brian Behlendorf	10271af05c	Fix gcc warning in kfpu_begin() Observed when building on CentOS 8 Stream. Remove the `out` label at the end of the function and instead return. linux/simd_x86.h: In function 'kfpu_begin': linux/simd_x86.h:337:1: error: label at end of compound statement Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13089	2022-02-11 14:31:45 -08:00
Jorgen Lundman	c28d6ab08b	Rename EMPTY_TASKQ into taskq_empty To follow a change in illumos taskq Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12802	2022-02-09 15:04:26 -07:00
Attila Fülöp	8e94ac0e36	Linux 5.16 compat: don't use XSTATE_XSAVE to save FPU state Linux 5.16 moved XSTATE_XSAVE and XSTATE_XRESTORE out of our reach, so add our own XSAVE{,OPT,S} code and use it for Linux 5.16. Please note that this differs from previous behavior in that it won't handle exceptions created by XSAVE an XRSTOR. This is sensible for three reasons. - Exceptions during XSAVE and XRSTOR can only occur if the feature is not supported or enabled or the memory operand isn't aligned on a 64 byte boundary. If this happens something else went terribly wrong, and it may be better to stop execution. - Previously we just printed a warning and didn't handle the fault, this is arguable for the above reason. - All other *SAVE instruction also don't handle exceptions, so this at least aligns behavior. Finally add a test to catch such a regression in the future. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13042 Closes #13059	2022-02-09 12:50:10 -08:00
Christian Schwarz	1dccfd7a38	zvol: make calls to platform ops static There's no need to make the platform ops dynamic dispatch. This change replaces the dynamic dispatch with static calls to the platform-specific functions. To avoid name collisions, prefix all platform-specific functions with `zvol_os_`. I actually find `zvol_..._os` slightly nicer to read in the calling code, but having it as a prefix is useful. Advantage: - easier jump-to-definition / grepping - potential benefits to static analysis - better legibility Future work: also prefix remaining `static` functions in zvol_os.c. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com> Closes #12965	2022-02-07 10:24:38 -08:00
Alexander Motin	f2c5bc150e	Add more control/visibility to spa_load_verify(). Use error thresholds from policy to control whether to scrub data and/or metadata. If threshold is set to UINT64_MAX, then caller probably does not care about result and we may skip that part. By default import neither set the data error threshold nor read the error counter, so skip the data scrub for faster import. Metadata are still scrubbed and fail if even single error found. While there just for symmetry return number of metadata errors in case threshold is not set to zero and we haven't reached it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #13022	2022-02-04 13:06:38 -08:00
Damian Szuberski	63652e1546	Add `--enable-asan` and `--enable-ubsan` switches `configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12928	2022-02-03 14:35:38 -08:00
Ryan Moeller	15aa38690e	Simplify resume token generation * Improve naming. * Reduce indentation. * Avoid boilerplate logic duplication. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #12967	2022-02-01 17:04:08 -08:00
наб	0f7a0cc7c2	libzfs: const correctness Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12968	2022-02-01 16:56:18 -08:00
Mark Johnston	fdcb79b52e	spl: Don't check FreeBSD rwlocks for double initialization (#13019 ) This checking breaks KMSAN since it effectively loads from uninitialized memory to see if the lock is already initialized. This happens in dnode_cons() for example. This checking is not very useful, partly due to UMA's memory trashing, and is already disabled for mutexes. Make mutexes and rwlocks consistent: remove double-initialization checking for rwlocks, and pass SX_NEW to disable the same checking in lock_init(). No functional change intended, this affects only debug builds. As a side note, kmem cache constructors/destructors are implemented suboptimally on FreeBSD. FreeBSD's slab allocator, UMA, supports two pairs of constructors/destructors: ctor/dtor and init/fini. The former are called upon every allocation and free of an item, while the latter are called when an item is imported or released from a zone, respectively. That is, when a slab is allocated to a particular cache, it is subdivided into items, and init is called on each. fini is called when the slab is being prepared to be freed back to the system. The intent is for them to initialize static fields such as locks, which do not need to be initialized upon each allocation of an item. In illumos, kmem_cache constructors/destructors correspond to UMA's init/fini callbacks. However, in the SPL they are implemented as UMA ctor/dtors, meaning that they get called far more often than necessary. This may be difficult to fix, since new code may assume the kmem cache ctor/dtors are in fact called upon each allocation/free, and there doesn't seem to be a clear way to implement the intended semantics on Linux. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #13019	2022-01-31 10:58:45 -08:00
наб	c70bb2f610	Replace *CTASSERT() with _Static_assert() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12993	2022-01-26 11:38:52 -08:00
наб	7ada752a93	Clean up CSTYLEDs 69 CSTYLED BEGINs remain, appx. 30 of which can be removed if cstyle(1) had a useful policy regarding CALL(ARG1, ARG2, ARG3); above 2 lines. As it stands, it spits out both sysctl_os.c: 385: continuation line should be indented by 4 spaces sysctl_os.c: 385: indent by spaces instead of tabs which is very cool Another >10 could be fixed by removing "ulong" &al. handling. I don't foresee anyone actually using it intentionally (does it even exist in modern headers? why did it in the first place?). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12993	2022-01-26 11:38:52 -08:00
наб	a9e2788ffe	libspl: cast to uintptr_t instead of !!ing This led to these two warning types: debug.h:139:67: warning: the address of ‘ARC_anon’ will always evaluate as ‘true’ [-Waddress] 139 \| #define ASSERT3P(x, y, z) ((void) sizeof (!!(x)), (void) sizeof (!!(z))) \| ^ arc.c:1591:2: note: in expansion of macro ‘ASSERT3P’ 1591 \| ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); \| ^~~~~~~~ and arc.h:66:44: warning: ‘<<’ in boolean context, did you mean ‘<’? [-Wint-in-bool-context] 66 \| #define HDR_GET_LSIZE(hdr) ((hdr)->b_lsize << SPA_MINBLOCKSHIFT) debug.h:138:46: note: in definition of macro ‘ASSERT3U’ 138 \| #define ASSERT3U(x, y, z) ((void) sizeof (!!(x)), (void) sizeof (!!(z))) \| ^ arc.c:1760:12: note: in expansion of macro ‘HDR_GET_LSIZE’ 1760 \| ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0); \| ^~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #13009	2022-01-24 17:05:42 -08:00
Rich Ercolani	299fbf75ec	Linux 5.16 compat: Added mapping for iov_iter_fault_in_readable Linux decided to rename this for some reason. At some point, we should probably invert this mapping, but for now... Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12975	2022-01-24 12:59:09 -08:00
Mark Johnston	6e2a59181e	Avoid memory allocations in the ARC eviction thread When the eviction thread goes to shrink an ARC state, it allocates a set of marker buffers used to hold its place in the state's sublists. This can be problematic in low memory conditions, since 1) the allocation can be substantial, as we allocate NCPU markers; 2) on at least FreeBSD, page reclamation can block in arc_wait_for_eviction() In particular, in stress tests it's possible to hit a deadlock on FreeBSD when the number of free pages is very low, wherein the system is waiting for the page daemon to reclaim memory, the page daemon is waiting for the ARC eviction thread to finish, and the ARC eviction thread is blocked waiting for more memory. Try to reduce the likelihood of such deadlocks by pre-allocating markers for the eviction thread at ARC initialization time. When evicting buffers from an ARC state, check to see if the current thread is the ARC eviction thread, and use the pre-allocated markers for that purpose rather than dynamically allocating them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #12985	2022-01-21 10:28:13 -08:00
наб	bc40713a8f	libspl: ASSERT: !! for sizeof sizeof(bitfield.member) is invalid, and this shows up in some FreeBSD build configurations: work around this by !!ing ‒ this makes the sizeof target the ! result type (_Bool), instead Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Fixes: `42aaf0e` ("libspl: ASSERT: mark arguments as used") Closes #12984 Closes #12986	2022-01-21 10:20:11 -08:00
наб	e1c720de7d	libefi: remove efi_type() All it is right now is some #if 0ed Solaris code that returns ENOSYS, and is only applicable for the Solaris blockdev layer. In the Illumos gate, there's a single user: rmformat(1); I recommend a read of the manual as a blast from the past, but, well Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12844 Closes #12969	2022-01-18 14:40:43 -08:00
наб	18168da727	module/*.ko: prune .data, global .rodata Evaluated every variable that lives in .data (and globals in .rodata) in the kernel modules, and constified/eliminated/localised them appropriately. This means that all read-only data is now actually read-only data, and, if possible, at file scope. A lot of previously- global-symbols became inlinable (and inlined!) constants. Probably not in a big Wowee Performance Moment, but hey. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12899	2022-01-14 15:37:55 -08:00
Tino Reichardt	a798b485ae	Remove sha1 hashing from OpenZFS, it's not used anywhere. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12895 Closes #12902	2022-01-06 16:16:28 -08:00
наб	5c8389a8cb	module: icp: rip out the Solaris loadable module architecture After progressively folding away null cases, it turns out there's /literally/ nothing there, even if some things are part of the Solaris SPARC DDI/DKI or the seventeen module types (some doubled for 32-bit userland), or the entire modctl syscall definition. Nothing. Initialisation is handled in illumos-crypto.c, which calls all the initialisers directly Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12895 Closes #12902	2022-01-06 16:14:04 -08:00
Paul Dagnelie	399b98198a	Revert "zfs list: Allow more fields in ZFS_ITER_SIMPLE mode" This reverts commit `f6a0dac84a`. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12938	2022-01-06 11:12:53 -08:00
Brian Behlendorf	3c80e0742a	Verify dRAID empty sectors Verify that all empty sectors are zero filled before using them to calculate parity. Failure to do so can result in incorrect parity columns being generated and written to disk if the contents of an empty sector are non-zero. This was possible because the checksum only protects the data portions of the buffer, not the empty sector padding. This issue has been addressed by updating raidz_parity_verify() to check that all dRAID empty sectors are zero filled. Any sectors which are non-zero will be fixed, repair IO issued, and a checksum error logged. They can then be safely used to verify the parity. This specific type of damage is unlikely to occur since it requires a disk to have silently returned bad data, for an empty sector, while performing a scrub. However, if a pool were to have been damaged in this way, scrubbing the pool with this change applied will repair both the empty sector and parity columns as long as the data checksum is valid. Checksum errors will be reported in the `zpool status` output for any repairs which are made. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12857	2022-01-04 16:46:32 -08:00
наб	c2f94afa0e	include: dmu.h: fix unused, remove argsused Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:42:47 -08:00
наб	83719bd68c	include: sys/arc.h: shim out arc_referenced() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:26 -08:00
наб	29d033c6b5	include: qat.h: mark unused macro arguments as used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:36:21 -08:00
наб	42aaf0e7c4	libspl: ASSERT*: mark arguments as used Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12844	2021-12-23 09:35:47 -08:00
наб	eb51a9d747	zcommon: pre-iterate over sysfs instead of statting every feature If sufficient memory (<2K, realistically) is available, libzfs_init() can be significantly shorted by iterating over the correct sysfs directory before registrations, we can turn 168 stats into 15/18 syscalls (3 opens (6 if built in), 3 fstats, 6 getdentses, and 3 closes), a tenfoldish reduction; this is probably a bit faster, too. The list is always optional, and registration functions (and one-off users) can simply pass NULL, which will fall back to the previous mechanism Also, don't allocate in zfs_mod_supported_impl, and use use access() instead of stat(), since existence is really what we care about Also, fix pre-prop-checking compat in fallback for built-in ZFS Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12089	2021-12-16 16:43:10 -08:00
Ryan Moeller	92a9e8c618	FreeBSD: Provide correct file generation number va_seq was actually a thin veil over va_gen, so z_gen is a more appropriate value than z_seq to populate the field with. Drop the unnecessary compat obfuscation and provide the correct file generation number. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@freebsd.org> Closes #12851	2021-12-16 13:22:15 -08:00
Allan Jude	f6a0dac84a	zfs list: Allow more fields in ZFS_ITER_SIMPLE mode If the fields to be listed and sorted by are constrained to those populated by dsl_dataset_fast_stat(), then zfs list is much faster, as it does not need to open each objset and reads its properties. A previous optimization by Pawel Dawidek (`0cee24064a`) took advantage of this to make listing snapshot names sorted only by name much faster. However, it was limited to `-o name -s name`, this work extends this optimization to work with: - name - guid - createtxg - numclones - inconsistent - redacted - origin and could be further extended to any other properties supported by dsl_dataset_fast_stat() or similar, that do not require extra locking or reading from disk. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #11080	2021-12-16 11:56:22 -08:00
наб	344bbc82e7	zfs, libzfs: diff: accept -h/ZFS_DIFF_NO_MANGLE, disabling path escaping Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12829	2021-12-13 15:49:40 -08:00
Paul Dagnelie	795075e638	Add `const` to nvlist functions to properly expose their real behavior Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12728	2021-12-06 18:19:13 -07:00
Allan Jude	2a673e76a9	Vdev Properties Feature Add properties, similar to pool properties, to each vdev. This makes use of the existing per-vdev ZAP that was added as part of device evacuation/removal. A large number of read-only properties are exposed, many of the members of struct vdev_t, that provide useful statistics. Adds support for read-only "removing" vdev property. Adds the "allocating" property that defaults to "on" and can be set to "off" to prevent future allocations from that top-level vdev. Supports user-defined vdev properties. Includes support for properties.vdev in SYSFS. Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #11711	2021-11-30 07:46:25 -07:00
Coleman Kane	75b309a938	Linux 5.16 compat: asm/fpu/xcr.h is new location for xgetbv/xsetbv Linux 5.16 moved these functions into this new header in commit 1b4fb8545f2b00f2844c4b7619d64d98440a477c. This change adds code to look for the presence of this header, and include it so that the code using xgetbv & xsetbv will compile again. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #12800	2021-11-29 10:49:33 -08:00
Rich Ercolani	269b5dadcf	Enable edonr in FreeBSD The code is integrated, builds fine, runs fine, there's not really any reason not to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12735	2021-11-16 12:40:10 -07:00
Martin Matuška	b8dcfb2c9f	FreeBSD: fix world build after `de198f2d9` The inline function vn_flush_cached_data() in vnode.h must not be compiled when building BASE. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #12743	2021-11-15 09:07:39 -07:00
George Amanakis	c9d62d1356	Introduce a tunable to exclude special class buffers from L2ARC Special allocation class or dedup vdevs may have roughly the same performance as L2ARC vdevs. Introduce a new tunable to exclude those buffers from being cacheable on L2ARC. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11761 Closes #12285	2021-11-11 12:52:16 -08:00
Fedor Uporov	49d42425d6	Check l2cache vdevs pending list inside the vdev_inuse() The l2cache device could be added twice because vdev_inuse() does not check spa_l2cache for added devices. Make l2cache vdevs inuse checking logic more closer to spare vdevs. Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9153 Closes #12689	2021-11-11 11:54:15 -08:00
Fedor Uporov	e39fe05b69	Skip spacemaps reading in case of pool readonly import The only zdb utility require to read metaslab-related data during read-only pool import because of spacemaps validation. Add global variable which will allow zdb read spacemaps in case of readonly import mode. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #9095 Closes #12687	2021-11-09 12:50:39 -08:00
Brian Behlendorf	1e7d634867	Linux 5.16 compat: linux/elevator.h Commit https://github.com/torvalds/linux/commit/2e9bc346 moved the elevator.h header under the block/ directory as part of some refactoring. This turns out not to be a problem since there's no longer anything we need from the header. This has been the case for some time, this change removes the elevator.h include and replaces it with a major.h include. Reviewed-by: Alexander Lobakin <alobakin@pm.me> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12725	2021-11-09 11:26:35 -08:00
Brian Behlendorf	de198f2d95	Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency When using lseek(2) to report data/holes memory mapped regions of the file were ignored. This could result in incorrect results. To handle this zfs_holey_common() was updated to asynchronously writeback any dirty mmap(2) regions prior to reporting holes. Additionally, while not strictly required, the dn_struct_rwlock is now held over the dirty check to prevent the dnode structure from changing. This ensures that a clean dnode can't be dirtied before the data/hole is located. The range lock is now also taken to ensure the call cannot race with zfs_write(). Furthermore, the code was refactored to provide a dnode_is_dirty() helper function which checks the dnode for any dirty records to determine its dirtiness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #11900 Closes #12724	2021-11-07 14:27:44 -07:00
Damian Szuberski	6d680e61ef	Update `checkstyle` workflow env to ubuntu-20.04 - `checkstyle` workflow uses ubuntu-20.04 environment - improved `mancheck.sh` readability Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes #12713	2021-11-02 14:02:57 -06:00
Fedor Uporov	d5a5ec4693	Remove unused function zvol_set_volblocksize() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #12688	2021-10-26 17:07:53 -07:00
Pawel Jakub Dawidek	afbc617921	Remove FreeBSD's local copy of the dmu_buf_hold_array() function Make the main dmu_buf_hold_array() function non-static. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #12628	2021-10-13 11:01:01 -07:00
Teodor Spæren	d785245857	zio: use unsigned values for enum cppcheck complains about the use of 1 << 31, because enums are signed ints which cannot represent this. As discussed in issue #12611, it appears that with C99, we can use an unsiged int for the enum, on most platforms. I've crafted this commit for just the include/sys/zio.h header, as it's the only one with a shift of 31. If this is something we want to adopt in the rest of the project, I will go through and apply it to the rest of the project. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Teodor Spæren <teodor@sparen.no> Closes #12611 Closes #12615	2021-10-11 10:58:06 -07:00
Rich Ercolani	9d1407e8f2	Correct refcount_add in dmu_zfetch refcount_add_many(foo,N) is not the same as for (i=0; i < N; i++) { refcount_add(foo); } Unfortunately, this is only actually true with debug kernels and reference_tracking_enable=1. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12589 Closes #12602	2021-10-08 11:10:34 -07:00
Brian Behlendorf	514498fef6	Simplify and document OpenZFS library dependencies For those not already familiar with the code base it can be a challenge to understand how the libraries are laid out. This has sometimes resulted in functionality being added in the wrong place. To help avoid that in the future this commit documents the high-level dependencies for easy reference in lib/Makefile.am. It also simplifies a few things. - Switched libzpool dependency on libzfs_core to libzutil. This change makes it clear libzpool should never depend on the ioctl() functionality provided by libzfs_core. - Moved zfs_ioctl_fd() from libzutil to libzfs_core and renamed it lzc_ioctl_fd(). Normal access to the kmods should all be funneled through the libzfs_core library. The sole exception is the pool_active() which was updated to not use lzc_ioctl_fd() to remove the libzfs_core dependency. - Removed libzfs_core dependency on libzutil. - Removed the lib/libzfs/os/freebsd/libzfs_ioctl_compat.c source file which was all dead code. - Removed libzfs_core dependency from mkbusy and ctime test utilities. It was only needed for some trivial wrapper functions and that code is easy to replicate to shed the unneeded dependency. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12602	2021-10-07 11:31:26 -06:00
Tony Hutter	2a8430a260	Rescan enclosure sysfs path on import When you create a pool, zfs writes vd->vdev_enc_sysfs_path with the enclosure sysfs path to the fault LEDs, like: vdev_enc_sysfs_path = /sys/class/enclosure/0:0:1:0/SLOT8 However, this enclosure path doesn't get updated on successive imports even if enclosure path to the disk changes. This patch fixes the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #11950 Closes #12095	2021-10-04 12:32:16 -07:00
Jorgen Lundman	1d901c3ee5	Upstream: unmount snapshots before destroying them on macOS Add function zfs_destroy_snaps_nvl_os() call. The main issue is that macOS needs to unmount any mounted snapshots before they can be destroyed. Other platforms can handle this in the kernel, but sending a storm of zed events to unmount seems undesirable when we can do it in userland to start with. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: ilovezfs <ilovezfs@icloud.com> Closes #12550	2021-09-20 09:29:59 -06:00
Brian Behlendorf	6954c22f35	Use fallthrough macro As of the Linux 5.9 kernel a fallthrough macro has been added which should be used to anotate all intentional fallthrough paths. Once all of the kernel code paths have been updated to use fallthrough the -Wimplicit-fallthrough option will because the default. To avoid warnings in the OpenZFS code base when this happens apply the fallthrough macro. Additional reading: https://lwn.net/Articles/794944/ Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12441	2021-09-14 10:17:54 -06:00
Jorgen Lundman	5a54a4e051	Upstream: Add snapshot and zvol events For kernel to send snapshot mount/unmount events to zed. For kernel to send symlink creates/removes on zvol plumbing. (/dev/run/dsk/zvol/$pool/$zvol -> /dev/diskX) If zed misses the ENODEV, all errors after are EINVAL. Treat any error as kernel module failure. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12416	2021-09-09 10:44:21 -07:00
Brian Behlendorf	2079111f42	Linux 5.15 compat: get_acl() Kernel commits 332f606b32b6 ovl: enable RCU'd ->get_acl() 0cad6246621b vfs: add rcu argument to ->get_acl() callback Added compatibility code to detect the new ->get_acl() interface and correctly handle the case where the new rcu argument is set. Reviewed-by: Coleman Kane <ckane@colemankane.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12548	2021-09-09 09:38:35 -07:00
Alexander	a3588c68f7	Linux 5.15 compat: standalone <linux/stdarg.h> Kernel commits 39f75da7bcc8 ("isystem: trim/fixup stdarg.h and other headers") c0891ac15f04 ("isystem: ship and use stdarg.h") 564f963eabd1 ("isystem: delete global -isystem compile option") (for now can be found in linux-next.git tree, will land into the Linus' tree during the ongoing 5.15 cycle with one of akpm merges) removed the -isystem flag and disallowed the inclusion of any compiler header files. They also introduced a minimal <linux/stdarg.h> as a replacement for <stdarg.h>. include/os/linux/spl/sys/cmn_err.h in the ZFS source tree includes <stdarg.h> unconditionally. Introduce a test for <linux/stdarg.h> and include it instead of the compiler's one to prevent module build breakage. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12531	2021-09-08 12:59:43 -07:00
Brian Behlendorf	f616605821	Linux 5.15 compat: block device readahead The 5.15 kernel moved the backing_dev_info structure out of the request queue structure which causes a build failure. Rather than look in the new location for the BDI we instead detect this upstream refactoring by the existance of either the blk_queue_update_readahead() or disk_update_readahead() functions. In either case, there's no longer any reason to manually set the ra_pages value since it will be overridden with a reasonable default (2x the block size) when blk_queue_io_opt() is called. Therefore, we update the compatibility wrapper to do nothing for 5.9 and newer kernels. While it's tempting to do the same for older kernels we want to keep the compatibility code to preserve the existing behavior. Removing it would effectively increase the default readahead to 128k. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12532	2021-09-08 09:03:13 -06:00
Jorgen Lundman	3e8d5e4ff3	Add zpool_disable_datasets_os() / zfs_unmount_os() zpool_disable_datasets_os(): macOS needs to do a bunch of work to kick everything off zvols. zfs_unmount_os(): This allows us to unmount any zvols that may be mounted. Like with zfs destroy foo/vol Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12436	2021-08-31 09:56:00 -06:00
Rich Ercolani	b1a1c64313	Fix cross-endian interoperability of zstd It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12008 Closes #12022	2021-08-30 14:13:46 -07:00
Trevor Bautista	00888c0898	Extend zpool-iostat to account for ZIO_PRIORITY_REBUILD (#12319 ) Previously, zpool-iostat did not display any data regarding rebuild I/Os in either the latency/size histograms (-w/-l/-r) or the queue data (-q). This fix essentially utilizes the existing infrastructure for tracking rebuild queue data and displays this data in the proper places within zpool-iostat's output. Signed-off-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Signed-off-by: Trevor Bautista <tbautista@lanl.gov> Co-authored-by: Trevor Bautista <tbautista@newmexicoconsortium.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2021-08-26 11:26:49 -07:00
Richard Yao	abbf0bd4eb	Linux 4.11 compat: statx support Linux 4.11 added a new statx system call that allows us to expose crtime as btime. We do this by caching crtime in the znode to match how atime, ctime and mtime are cached in the inode. statx also introduced a new way of reporting whether the immutable, append and nodump bits have been set. It adds support for reporting compression and encryption, but the semantics on other filesystems is not just to report compression/encryption, but to allow it to be turned on/off at the file level. We do not support that. We could implement semantics where we refuse to allow user modification of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc() to find out encryption/compression information. That would introduce locking that will have a minor (although unmeasured) performance cost. It also would be inferior to zdb, which reports far more detailed information. We therefore omit reporting of encryption/compression through statx in favor of recommending that users interested in such information use zdb. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #8507	2021-08-17 10:25:58 -07:00
Alexander Motin	6b88b4b501	Remove b_pabd/b_rabd allocation from arc_hdr_alloc() When a header is allocated for full overwrite it is a waste of time to allocate b_pabd/b_rabd for it, since arc_write() will free them without ever being touched. If it is a read or a partial overwrite then arc_read() and arc_hdr_decrypt() allocate them explicitly. Reduced memory allocation in user threads also reduces ARC eviction throttling there, proportionally increasing it in ZIO threads, that is not good. To minimize or even avoid it introduce ARC allocation reserve, allowing certain arc_get_data_abd() callers to allocate a bit longer in situations where user threads will already throttle. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12398	2021-08-17 10:15:54 -06:00
Alexander Motin	72f0521aba	Increase default volblocksize from 8KB to 16KB Many things has changed since previous default was set many years ago. Nowadays 8KB does not allow adequate compression or even decent space efficiency on many of pools due to 4KB disk physical block rounding, especially on RAIDZ and DRAID. It effectively limits write throughput to only 2-3GB/s (250-350K blocks/s) due to sync thread, allocation, vdev queue and other block rate bottlenecks. It keeps L2ARC expensive despite many optimizations and dedup just unrealistic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12406	2021-08-17 09:59:46 -06:00
Alexander Motin	cfe8e960f1	Fix/improve dbuf hits accounting Instead of clearing stats inside arc_buf_alloc_impl() do it inside arc_hdr_alloc() and arc_release(). It fixes statistics being wiped every time a new dbuf is filled from the ARC. Remove b_l1hdr.b_l2_hits. L2ARC hits are accounted at b_l2hdr.b_hits. Since the hits are accounted under hash lock, replace atomics with simple increments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12422	2021-08-17 09:50:31 -06:00
Alexander Motin	e829a865bf	Use more atomics in refcounts Use atomic_load_64() for zfs_refcount_count() to prevent torn reads on 32-bit platforms. On 64-bit ones it should not change anything. When built with ZFS_DEBUG but running without tracking enabled use atomics instead of mutexes same as for builds without ZFS_DEBUG. Since rc_tracked can't change live we can check it without lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12420	2021-08-17 09:44:34 -06:00
Allan Jude	e945e8d7f4	Restore FreeBSD sysctl processing for arc.min and arc.max Before OpenZFS 2.0, trying to set the FreeBSD sysctl vfs.zfs.arc_max to a disallowed value would return an error. Since the switch, it instead only generates WARN_IF_TUNING_IGNORED Keep the ability to set the sysctl's specifically to 0, even though that is less than the minimum, because some tests depend on this. Also lost, was the ability to set vfs.zfs.arc_max to a value less than the default vfs.zfs.arc_min at boot time. Restore this as well. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #12161	2021-08-16 09:35:19 -06:00
Tony Nguyen	6bc61d22c4	Run arc_evict thread at higher priority Run arc_evict thread at higher priority, nice=0, to give it more CPU time which can improve performance for workload with high ARC evict activities. On mixed read/write and sequential read workloads, I've seen between 10-40% better performance. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com> Closes #12397	2021-08-10 11:36:26 -06:00
Alexander Motin	7eebcd2be6	Avoid small buffer copying on write It is wrong for arc_write_ready() to use zfs_abd_scatter_enabled to decide whether to reallocate/copy the buffer, because the answer is OS-specific and depends on the buffer size. Instead of that use abd_size_alloc_linear(), moved into public header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12425	2021-07-27 16:05:47 -07:00
Brian Behlendorf	4bd99c11d7	Remove overlooked __sun_attr__ based macros The __NORETURN, __CONST, and __PURE macros in the FreeBSD platform code were based on the __sun_attr__ macro which was removed in commit `5dbf6c5a6`. This caused a build failure because the __NORETURN macro was still used in one place in kernel code. The __CONST and __PURE macros were entirely unused. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12435	2021-07-27 09:49:11 -07:00
наб	037af3e0d4	Remove NOTE(CONSTCOND) and note.h These were mostly used to annotate do {} while(0)s Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:53 -07:00
наб	5dbf6c5a66	Replace /PRINTFLIKEn/ with attribute(printf) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #12201	2021-07-26 12:07:15 -07:00
Alexander Motin	1b50749ce9	Optimize allocation throttling Remove mc_lock use from metaslab_class_throttle_(). The math there is based on refcounts and so atomic, so the only race possible there is between zfs_refcount_count() and zfs_refcount_add(). But in most cases metaslab_class_throttle_reserve() is called with the allocator lock held, which covers the race. In cases where the lock is not held, GANG_ALLOCATION() or METASLAB_MUST_RESERVE are set, and so we do not use zfs_refcount_count(). And even if we assume some other non-existing scenario, the worst that may happen from this race is few more I/Os get to allocation earlier, that is not a problem. Move locks and data of different allocators into different cache lines to avoid false sharing. Group spa_alloc_ arrays together into single array of aligned struct spa_alloc spa_allocs. Align struct metaslab_class_allocator. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12314	2021-07-21 06:40:36 -06:00
Kevin Jin	a7bd20e309	Add Module Parameter Regarding Log Size Limit * Add Module Parameters Regarding Log Size Limit zfs_wrlog_data_max The upper limit of TX_WRITE log data. Once it is reached, write operation is blocked, until log data is cleared out after txg sync. It only counts TX_WRITE log with WR_COPIED or WR_NEED_COPY. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12284	2021-07-20 09:40:24 -06:00
Alexander Motin	8172df643b	Minor ARC optimizations Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12348	2021-07-20 08:13:21 -06:00
Alexander	23c13c7e80	A few fixes of callback typecasting (for the upcoming ClangCFI) * zio: avoid callback typecasting * zil: avoid zil_itxg_clean() callback typecasting * zpl: decouple zpl_readpage() into two separate callbacks * nvpair: explicitly declare callbacks for xdr_array() * linux/zfs_nvops: don't use external iput() as a callback * zcp_synctask: don't use fnvlist_free() as a callback * zvol: don't use ops->zv_free() as a callback for taskq_dispatch() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Closes #12260	2021-07-20 08:03:33 -06:00
Kevin Bowling	ca14e08cbf	Detect HAVE_LARGE_STACKS at compile time Move HAVE_LARGE_STACKS definitions to header and set when appropriate. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Kevin Bowling <kbowling@FreeBSD.org> Closes #12350	2021-07-16 14:28:55 -06:00
Alexander Motin	c1b5869bab	Introduce dsl_dir_diduse_transfer_space() Most of dsl_dir_diduse_space() and dsl_dir_transfer_space() CPU time is a dd_lock overhead and time spent in dmu_buf_will_dirty(). Calling them one after another is a waste of time and even more contention. Doing that twice for each rewritten block within dbuf_write_done() via dsl_dataset_block_kill() and dsl_dataset_block_born() created one of the biggest CPU overheads in case of small blocks rewrite. dsl_dir_diduse_transfer_space() combines functionality of these two functions for cases where it is needed, but without double overhead, practically for the cost of dsl_dir_diduse_space() or even cheaper. While there, optimize dsl_dir_phys() calls in dsl_dir_diduse_space() and dsl_dir_transfer_space(). It seems Clang detects some aliasing there, repeating dd->dd_dbuf->db_data dereference multiple times, increasing dd_lock scope and contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Author: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12300	2021-07-16 13:39:24 -06:00
Alexander Motin	f7de776da2	Fix ARC ghost states eviction accounting arc_evict_hdr() returns number of evicted bytes in scope of specific state. For ghost states it does not mean the amount of really freed memory, but the logical buffer size. It is correct for the eviction process, but not for waking up threads waiting for ARC size reduction, as added in "Revise ARC shrinker algorithm" commit, causing premature wakeups while ARC is still overflowed, allowing even bigger overflow, plus processing overhead when next allocation will also get blocked, probably also for too short time. To fix that make arc_evict_hdr() also return the amount of really freed memory, which for the ghost states is only the header, and use it to update arc_evict_count instead. Originally I was thinking to not return it at all, since arc_get_data_impl() does not account for the headers, but decided that some slow allocation progress is better than long waits, reaching on my tests up to 100ms. To reduce negative latency effects of long time periods when reclaim thread can free little real memory, start reclamation process earlier, before we actually reached the overflow threshold, when we have to throttle new allocations. We can also do it without taking global arc_evict_lock, reducing the contention. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12279	2021-07-13 09:41:59 -06:00
George Wilson	958826be7a	file reference counts can get corrupted Callers of zfs_file_get and zfs_file_put can corrupt the reference counts for the file structure resulting in a panic or a soft lockup. When zfs send/recv runs, it will add a reference count to the open file, and begin to send or recv the stream. If the file descriptor is closed, then when dmu_recv_stream() or dmu_send() return we will call zfs_file_put to remove the reference we placed on the file structure. Unfortunately, because zfs_file_put() uses the file descriptor to lookup the file structure, it may end up finding that the file descriptor table no longer contains the file struct, thus leaking the file structure. Or it might end up finding a file descriptor for a different file and blindly updating its reference counts. Other failure modes probably exists. This change reworks the zfs_file_[get\|put] interface to not rely on the file descriptor but instead pass the zfs_file_t pointer around. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Allan Jude <allan@klarasystems.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-issue: DLPX-76119 Closes #12299	2021-07-10 19:00:37 -06:00
Jorgen Lundman	03dba7ae31	dprintf_dnode: strcpy -> strlcpy Missed a couple of strcpy() in earlier commit, this is only used with --enable-debug. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #12311	2021-07-07 21:13:40 -06:00
Alexander Motin	bdd11cbb90	FreeBSD: Hardcode abd_chunk_size to PAGE_SIZE It makes no sense to set it below PAGE_SIZE, since it increases all overheads and makes returning memory to OS problematic. It makes no sense to set it above PAGE_SIZE, since such allocations and especially frees are too expensive and cause KVA fragmentation to benefit from fewer chunks. After that it makes no sense to keep more complicated math here. What may have sense though is just a tunable border between linear and scatter ABDs, previously also controlled by this tunable. Retain that functionality by taking abd_scatter_min_size tunable from Linux, just with different default value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #12328	2021-07-06 17:39:23 -07:00
Alexander Motin	b192a2c0a1	Remove avl_size field from struct avl_tree This field is used only by illumos mdb. On other platforms it only increases the struct size from 32 to 40 bytes. For struct vdev_queue including 13 instances of avl_tree_t size means active cache lines. Keep the padding in user-space for now to not break the ABI. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12290	2021-07-01 09:32:31 -06:00
Alexander Motin	490c845efe	Compact dbuf/buf hashes and lock arrays With default dbuf cache size of 1/32 of ARC, it makes no sense to have hash table of the same size (or even bigger on Linux). Reduce it to 1/8 of ARC's one, still leaving some slack, assuming higher I/O rate via dbuf cache than via ARC. Remove padding from ARC hash locks array. The idea behind padding is to avoid false sharing between locks. It would have sense if there would be a limited number of very busy locks. But since we have no limit on the number, using the same memory for more locks we can achieve even lower lock contention with the same false sharing, or we can use less memory for the same contention level. Reduce number of hash locks from 8192 to 2048. The number is still big enough to not cause contention, but reduced memory size improves cache hit rate for mutex_tryenter() in ARC eviction thread, saving about 1% of the thread time. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12289	2021-07-01 09:30:31 -06:00
Jorgen Lundman	c6d1112bf4	Fix abd leak, kmem_free correct size of abd_t Fix a leak of abd_t that manifested mostly when using raidzN with at least as many columns as N (e.g. a four-disk raidz2 but not a three-disk raidz2). Sufficiently heavy raidz use would eventually run a system out of memory. Additionally: * Switch abd_cache arena to FIRSTFIT, which empirically improves perofrmance. * Make abd_chunk_cache more performant and debuggable. * Allocate the abd_zero_buf from abd_chunk_cache rather than the heap. * Don't try to reap non-existent qcaches in abd_cache arena. * KM_PUSHPAGE->KM_SLEEP when allocating chunks from their own arena Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: Sean Doran <smd@use.net> Closes #12295	2021-07-01 09:28:15 -06:00
Kevin Jin	50e09eddd0	Optimize txg_kick() process (#12274 ) Use dp_dirty_pertxg[] for txg_kick(), instead of dp_dirty_total in original code. Extra parameter "txg" is added for txg_kick(), thus it knows which txg to kick. Also txg_kick() call is moved from dsl_pool_need_dirty_delay() to dsl_pool_dirty_space() so that we can know the txg number assigned for txg_kick(). Some unnecessary code regarding dp_dirty_total in txg_sync_thread() is also cleaned up. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: jxdking <lostking2008@hotmail.com> Closes #12274	2021-07-01 09:20:27 -06:00
Alexander Motin	42afb12da7	Remove refcount from spa_config_() The only reason for spa_config_() to use refcount instead of simple non-atomic (thanks to scl_lock) variable for scl_count is tracking, hard disabled for the last 8 years. Switch to simple int scl_count reduces the lock hold time by avoiding atomic, plus makes structure fit into single cache line, reducing the locks contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12287	2021-07-01 09:16:54 -06:00
Martin Matuška	14d2841b53	FreeBSD: fix compilation of FreeBSD world after `29274c9f6` prng32_bounded() is available to kernel only on FreeBSD 13+. Call inline random_get_pseudo_bytes() with correct pointer type. To be consistent, apply to Linux as well. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes #12282	2021-06-25 10:28:51 -07:00
Attila Fülöp	1b610ae45f	gcc 11 cleanup Compiling with gcc 11.1.0 produces three new warnings. Change the code slightly to avoid them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #12130 Closes #12188 Closes #12237	2021-06-23 17:57:06 -06:00
Rich Ercolani	8e739b2c9f	Annotated dprintf as printf-like ZFS loves using %llu for uint64_t, but that requires a cast to not be noisy - which is even done in many, though not all, places. Also a couple places used %u for uint64_t, which were promoted to %llu. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12233	2021-06-22 21:53:45 -07:00
Alexander Motin	29274c9f6d	Optimize small random numbers generation In all places except two spa_get_random() is used for small values, and the consumers do not require well seeded high quality values. Switch those two exceptions directly to random_get_pseudo_bytes() and optimize spa_get_random(), renaming it to random_in_range(), since it is not related to SPA or ZFS in general. On FreeBSD directly map random_in_range() to new prng32_bounded() KPI added in FreeBSD 13. On Linux and in user-space just reduce the type used to uint32_t to avoid more expensive 64bit division. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12183	2021-06-22 17:35:23 -06:00
Alexander Motin	c4c162c1e8	Use wmsum for arc, abd, dbuf and zfetch statistics. (#12172 ) wmsum was designed exactly for cases like these with many updates and rare reads. It allows to completely avoid atomic operations on congested global variables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12172	2021-06-16 18:19:34 -06:00
наб	0854d4c186	libzutil: add zfs_{base,dir}name() Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12105	2021-06-11 09:10:05 -07:00
Alexander Motin	ffdf019cb3	Re-embed multilist_t storage This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-by: iXsystems, Inc. Closes #12158	2021-06-10 10:42:31 -06:00
Alexander Motin	371f88d96f	Remove pool io kstats (#12212 ) This mostly reverts "3537 want pool io kstats" commit of 8 years ago. From one side this code using pool-wide locks became pretty bad for performance, creating significant lock contention in I/O pipeline. From another, there are more efficient ways now to obtain detailed statistics, while this statistics is illumos-specific and much less usable on Linux and FreeBSD, reported only via procfs/sysctls. This commit does not remove KSTAT_TYPE_IO implementation, that may be removed later together with already unused KSTAT_TYPE_INTR and KSTAT_TYPE_TIMER. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12212	2021-06-10 08:27:33 -07:00
наб	327c904615	lib{efi,avl,share,tpool,zfs_core,zfsbootenv,zutil}: -fvisibility=hidden No symbols affected in libavl No symbols affected by libtpool, but pre-ANSI declarations got purged No symbols affected by libzfs_core No symbols affected by libzfs_bootenv libefi got cleaned, gained efi_debug documentation in efi_partition.h, and removes one undocumented and unused symbol from libzfs_core: D default_vtoc_map libnvpair saw removal of these symbols: D nv_alloc_nosleep_def D nv_alloc_sleep D nv_alloc_sleep_def D nv_fixed_ops_def D nvlist_hashtable_init_size D nvpair_max_recursion libshare saw removal of these symbols from libzfs: T libshare_nfs_init T libshare_smb_init T register_fstype B smb_shares libzutil saw removal of these internal symbols from libzfs_core: T label_paths T slice_cache_compare T zpool_find_import_blkid T zpool_open_func T zutil_alloc T zutil_strdup Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12191	2021-06-09 17:04:32 -07:00
наб	d406a695c6	libefi: remove efi_auto_sense() It's present (but undocumented) in the illumos gate and used exclusively by rmformat(1) (which I recommend as a nice blast from the past), and also the math assumes 512B sectors and is therefore wrong Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12191	2021-06-09 17:03:42 -07:00
Alan Somers	75b4cbf625	libzfs: On FreeBSD, use MNT_NOWAIT with getfsstat `getfsstat(2)` is used to retrieve the list of mounted file systems, which libzfs uses when fetching properties like mountpoint, atime, setuid, etc. The `mode` parameter may be `MNT_NOWAIT`, which uses information in the VFS's cache, or `MNT_WAIT`, which effectively does a `statfs` on every single mounted file system in order to fetch the most up-to-date information. As far as I can tell, the only fields that libzfs cares about are the filesystem's name, mountpoint, fstypename, and mount flags. Those things are always updated on mount and unmount, so they will always be accurate in the VFS's mount cache except in two circumstances: 1) When a file system is busy unmounting 2) When a ZFS file system changes the value of a mount-overridable property like atime or setuid, but doesn't remount the file system. Right now that only happens when the property is changed by an unprivileged user who has delegated authority to change the property but not to mount the dataset. But perhaps libzfs could choose to do it for other reasons in the future. Switching to `MNT_NOWAIT` will greatly improve speed with no downside, as long as we explicitly update the mount cache whenever we change a mount-overridable property. For comparison, Illumos gets this information using the native `getmntany` and `getmntent` functions, which also use cached information. The illumos function that would refresh the cache, `resetmnttab`, is never called by libzfs. And on GNU/Linux, `getmntany` and `getmntent` don't even communicate with the kernel directly. They simply parse the file they are given, which is usually /etc/mtab or /proc/mounts. Perhaps the implementation of /proc/mounts is synchronous, ala MNT_WAIT; I don't know. Sponsored-by: Axcient Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes: #12091	2021-06-08 07:36:43 -06:00
Alexander Motin	ea400129c3	More aggsum optimizations - Avoid atomic_add() when updating as_lower_bound/as_upper_bound. Previous code was excessively strong on 64bit systems while not strong enough on 32bit ones. Instead introduce and use real atomic_load() and atomic_store() operations, just an assignments on 64bit machines, but using proper atomics on 32bit ones to avoid torn reads/writes. - Reduce number of buckets on large systems. Extra buckets not as much improve add speed, as hurt reads. Unlike wmsum for aggsum reads are still important. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12145	2021-06-07 09:02:47 -07:00
наб	739cfb965b	libzfs: convert to -fvisibility=hidden Also mark all printf-like funxions in libzfs_impl.h as printf-like and add --no-show-locs to storeabi, in hopes diffs will make more sense in future This removes these symbols from libzfs: D nfs_only T SHA256Init T SHA2Final T SHA2Init T SHA2Update T SHA384Init T SHA512Init D share_all_proto D smb_only T zfs_is_shared_proto W zpool_mount_datasets W zpool_unmount_datasets Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12048	2021-06-03 13:17:55 -07:00
наб	eefaa55f64	libzfs: don't distribute libzfs_impl.h Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12048	2021-06-03 13:17:35 -07:00
наб	94f942c658	libspl: staticify buf and pagesize, rename aok to libspl_assert_ok Exporting names this short can easily cause nasty collisions with user code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12050	2021-06-03 11:04:13 -06:00
наб	757df52928	libzfs: add zfs_get_underlying_type. Stop including libzfs_impl.h in cmd Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12116	2021-05-29 14:26:38 -07:00
наб	e618e4a4ff	include: move SPA_MINBLOCKSHIFT and zio_encrypt to sys/fs/zfs.h These are used by userspace, so should live in a public header Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12116	2021-05-29 14:26:32 -07:00
наб	c4e5a07fd0	libzfs: expose zfs_mount_delegation_check Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12116	2021-05-29 14:25:48 -07:00
Alexander Motin	86706441a8	Introduce write-mostly sums wmsum counters are a reduced version of aggsum counters, optimized for write-mostly scenarios. They do not provide optimized read functions, but instead allow much cheaper add function. The primary usage is infrequently read statistic counters, not requiring exact precision. The Linux implementation is directly mapped into percpu_counter KPI. The FreeBSD implementation is directly mapped into counter(9) KPI. In user-space due to lack of better implementation mapped to aggsum. Unfortunately neither Linux percpu_counter nor FreeBSD counter(9) provide sufficient functionality to completelly replace aggsum, so it still remains to be used for several hot counters. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #12114	2021-05-27 14:27:29 -06:00
Rich Ercolani	ba646e3e89	Bend zpl_set_acl to permit the new userns* parameter Just like #12087, the set_acl signature changed with all the bolted-on *userns parameters, which disabled set_acl usage, and caused #12076. Turn zpl_set_acl into zpl_set_acl and zpl_set_acl_impl, and add a new configure test for the new version. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #12076 Closes #12093	2021-05-27 08:55:49 -07:00
наб	69cbd0a360	Various Linux kABI cosmetics Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes #12103	2021-05-26 15:26:06 -07:00
Brian Behlendorf	8fb577ae6d	Fix dRAID sequential resilver silent damage handling This change addresses two distinct scenarios which are possible when performing a sequential resilver to a dRAID pool with vdevs that contain silent unknown damage. Which in this circumstance took the form of the devices being intentionally overwritten with zeros. However, it could also result from a device returning incorrect data while a sequential resilver was in progress. Scenario 1) A sequential resilver is performed while all of the dRAID vdevs are ONLINE and there is silent damage present on the vdev being resilvered. In this case, nothing will be repaired by vdev_raidz_io_done_reconstruct_known_missing() because rc->rc_error isn't set on any of the raid columns. To address this vdev_draid_io_start_read() has been updated to always mark the resilvering column as ESTALE for sequential resilver IO. Scenario 2) Multiple columns contain silent damage for the same block and a sequential resilver is performed. In this case it's impossible to generate the correct data from parity unless all of the damaged columns are being sequentially resilvered (and thus only good data is used to generate parity). This is as expected and there's nothing which can be done about it. However, we need to be careful not to make to situation worse. Since we can't verify the data is actually good without a checksum, we must only repair the devices which are being sequentially resilvered. Otherwise, an incorrect repair to a device which previously contained good data could effectively lock in the damage and make reconstruction impossible. A check for this was added to vdev_raidz_io_done_verified() along with a new test case. Lastly, this change updates the redundancy_draid_spare1 and redundancy_draid_spare3 test cases to be more representative of normal dRAID replacement operation. Specifically, what we care about is that the scrub run after a sequential resilver does not find additional blocks which need repair. This would indicate the sequential resilver failed to rebuild a section of one of the devices. Note also the tests were switched to using the verify_pool() function which still checks for checksum errors. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #12061	2021-05-20 15:05:26 -07:00
наб	37086897b0	libzfs: add keylocation=https://, backed by fetch(3) or libcurl Add support for http and https to the keylocation properly to allow encryption keys to be fetched from the specified URL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Issue #9543 Closes #9947 Closes #11956	2021-05-12 21:21:35 -07:00
Coleman Kane	48c7b0e444	linux 5.13 compat: bdevops->revalidate_disk() removed Linux kernel commit 0f00b82e5413571ed225ddbccad6882d7ea60bc7 removes the revalidate_disk() handler from struct block_device_operations. This caused a regression, and this commit eliminates the call to it and the assignment in the block_device_operations static handler assignment code, when configure identifies that the kernel doesn't support that API handler. Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #11967 Closes #11977	2021-05-11 19:53:02 -07:00

1 2 3 4 5 ...

2067 Commits