mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-16 00:14:53 +03:00

Author	SHA1	Message	Date
Brian Atkinson	c0801bf35a	Cleaning up uio headers Making uio_impl.h the common header interface between Linux and FreeBSD so both OS's can share a common header file. This also helps reduce code duplication for zfs_uio_t for each OS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11622	2021-02-20 20:16:50 -08:00
Ryan Moeller	64e0fe14ff	Restore FreeBSD resource usage accounting Add zfs_racct_* interfaces for platform-dependent read/write accounting. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11613	2021-02-19 22:34:33 -08:00
Don Brady	03e02e5b56	Checksum errors may not be counted Fix regression seen in issue #11545 where checksum errors where not being counted or showing up in a zpool event. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #11609	2021-02-19 22:33:15 -08:00
Colm	658fb8020f	Add "compatibility" property for zpool feature sets Property to allow sets of features to be specified; for compatibility with specific versions / releases / external systems. Influences the behavior of 'zpool upgrade' and 'zpool create'. Initial man page changes and test cases included. Brief synopsis: zpool create -o compatibility=off\|legacy\|file[,file...] pool vdev... compatibility = off : disable compatibility mode (enable all features) compatibility = legacy : request that no features be enabled compatibility = file[,file...] : read features from specified files. Only features present in all files will be enabled on the resulting pool. Filenames may be absolute, or relative to /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d (/etc checked first). Only affects zpool create, zpool upgrade and zpool status. ABI changes in libzfs: * New function "zpool_load_compat" to load and parse compat sets. * Add "zpool_compat_status_t" typedef for compatibility parse status. * Add ZPOOL_PROP_COMPATIBILITY to the pool properties enum * Add ZPOOL_STATUS_COMPATIBILITY_ERR to the pool status enum An initial set of base compatibility sets are included in cmd/zpool/compatibility.d, and the Makefile for cmd/zpool is modified to install these in $pkgdatadir/compatibility.d and to create symbolic links to a reasonable set of aliases. Reviewed-by: ericloewe Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11468	2021-02-17 21:30:45 -08:00
khng300	fc273894d2	Rename zfs_inode_update to zfs_znode_update_vfs zfs_znode_update_vfs is a more platform-agnostic name than zfs_inode_update. Besides that, the function's prototype is moved to include/sys/zfs_znode.h as the function is also used in common code. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ka Ho Ng <khng300@gmail.com> Sponsored by: The FreeBSD Foundation Closes #11580	2021-02-09 11:17:29 -08:00
Antonio Russo	f8ce8aed0c	Set file mode during zfs_write `3d40b65` refactored zfs_vnops.c, which shared much code verbatim between Linux and BSD. After a successful write, the suid/sgid bits are reset, and the mode to be written is stored in newmode. On Linux, this was propagated to both the in-memory inode and znode, which is then updated with sa_update. `3d40b65` accidentally removed the initialization of newmode, which happened to occur on the same line as the inode update (which has been moved out of the function). The uninitialized newmode can be saved to disk, leading to a crash on stat() of that file, in addition to a merely incorrect file mode. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11474 Closes #11576	2021-02-08 09:15:05 -08:00
Christian Schwarz	84268b099b	Document monotonicity of dmu_tx_assign() and txg_hold_open() Expand the comments to make it clear exactly what is guaranteed by dmu_tx_assign() and txg_hold_open(). Additionally, update the comment which refers to txg_exit() when it should reference txg_rele_to_sync(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11521	2021-02-02 10:11:37 -08:00
Matthew Ahrens	2d4bbd14fc	The abd child/parent relationship does not need to be tracked ABD's currently track their parent/child relationship. This applies to `abd_get_offset()` and `abd_borrow_buf()`. However, nothing depends on knowing this relationship, it's only used for consistency checks to verify that we are not destroying an ABD that's still in use. When we are creating/destroying ABD's frequently, the performance impact of maintaining these data structures (in particular the atomic increment/decrement operations) can be measurable. This commit removes this verification code on production builds, but keeps it when ZFS_DEBUG is set. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11535	2021-01-30 10:04:42 -08:00
Brian Atkinson	2993698eb3	Fixing gang ABD when adding another gang I originally applied a fix in #11539 to fix a parent's child references when a gang ABD is free'd. However, I did not take into account abd_gang_add_gang(). We still need to make sure to update the child references in this function as well. In order to resolve this I removed decreasing the gang ABD's size in abd_free_gang() as well as moved back the original placeent of zfs_refcount_remove_many() in abd_free(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11542	2021-01-28 16:54:12 -08:00
George Amanakis	0ae184a6ba	Avoid updating the L2ARC device header unnecessarily If we do not write any buffers to the cache device and the evict hand has not advanced do not update the cache device header. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11522 Closes #11537	2021-01-28 09:20:03 -08:00
Brian Atkinson	416015ef54	Removing ABD Parent Child Reference Before Freeing ABD Moving the call to zfs_refcount_remove_many() in abd_free() to be called before any of the ABD free variants are called. This is necessary because abd_free_gang() adjusts the abd_size for the gang ABD. If the parent's child references are removed after free'ing the gang ABD the refcount is not adjusted correctly for the parent's children. I also removed some stray abd_put() in comments and changed abd_free_gang_abd() -> abd_free_gang(). Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11539	2021-01-28 09:15:17 -08:00
Mark Maybee	b2c5904a78	Revert special case code from pre-hashtable nvlist era Before a hash table was added on top of the nvlist code, there were cases where the nvlist allocation was changed from fnvlist_alloc() to nvlist_alloc() to avoid expensive NV_UNIQUE_NAME checks. Now this is no longer necessary. These changes should be reverted to be consistent with other code. There are some cases where this change will also reduce the number of iterations. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Maybee <mark.maybee@delphix.com> Closes #11464	2021-01-27 21:31:51 -08:00
Alan Somers	cf0977ad72	Parallelize vdev_validate The runtime of vdev_validate is dominated by the disk accesses in vdev_label_read_config. Speed it up by validating all vdevs in parallel using a taskq. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:36:51 -08:00
Alan Somers	67874d5487	Read all disk labels concurrently in vdev_label_read_config This is similar to what we already do in vdev_geom_read_config. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:36:02 -08:00
Alan Somers	a0e01997ec	Parallelize vdev_load metaslab_init is the slowest part of importing a mature pool, and it must be repeated hundreds of times for each top-level vdev. But its speed is dominated by a few serialized disk accesses. That can lead to import times of > 1 hour for pools with many top-level vdevs on spinny disks. Speed up the import by using a taskqueue to parallelize vdev_load across all top-level vdevs. This also requires adding mutex protection to metaslab_class_t.mc_historgram. The mc_histogram fields were unprotected when that code was first written in "Illumos 4976-4984 - metaslab improvements" (OpenZFS `f3a7f6610f`). The lock wasn't added until `3dfb57a35e`, though it's unclear exactly which fields it's supposed to protect. In any case, it wasn't until vdev_load was parallelized that any code attempted concurrent access to those fields. Sponsored by: Axcient Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #11470	2021-01-26 19:35:59 -08:00
Matthew Ahrens	62d4287f27	RAIDZ2/3 fails to heal silently corrupted parity w/2+ bad disks When scrubbing, (non-sequential) resilvering, or correcting a checksum error using RAIDZ parity, ZFS should heal any incorrect RAIDZ parity by overwriting it. For example, if P disks are silently corrupted (P being the number of failures tolerated; e.g. RAIDZ2 has P=2), `zpool scrub` should detect and heal all the bad state on these disks, including parity. This way if there is a subsequent failure we are fully protected. With RAIDZ2 or RAIDZ3, a block can have silent damage to a parity sector, and also damage (silent or known) to a data sector. In this case the parity should be healed but it is not. The problem can be noticed by scrubbing the pool twice. Assuming there was no damage concurrent with the scrubs, the first scrub should fix all silent damage, and the second scrub should be "clean" (`zpool status` should not report checksum errors on any disks). If the bug is encountered, then the second scrub will repair the silently-damaged parity that the first scrub failed to repair, and these checksum errors will be reported after the second scrub. Since the first scrub repaired all the damaged data, the bug can not be encountered during the second scrub, so subsequent scrubs (more than two) are not necessary. The root cause of the problem is some code that was inadvertently added to `raidz_parity_verify()` by the DRAID changes. The incorrect code causes the parity healing to be aborted if there is damaged data (`rc_error != 0`) or the data disk is not present (`!rc_tried`). These checks are not necessary, because we only call `raidz_parity_verify()` if we have the correct data (which may have been reconstructed using parity, and which was verified by the checksum). This commit fixes the problem by removing the incorrect checks in `raidz_parity_verify()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11489 Closes #11510	2021-01-26 16:05:05 -08:00
Will Andrews	f4f50a7048	spa_export_common: refactor common exit points Create a common exit point for spa_export_common (a very long function), which avoids missing steps on failure. This work is helpful for the planned forced pool export changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Will Andrews <will@firepipe.net> Closes #11514	2021-01-25 15:04:11 -08:00
Colm	4a90d4d6fc	Fix two minor lint errors (cppcheck) Fix two minor errors reported by cppcheck: In module/zfs/abd.c (abd_get_offset_impl), add non-NULL assertion to prevent NULL dereference warning. In module/zfs/arc.c (l2arc_write_buffers), change 'try' variable to 'pass' to avoid C++ reserved word. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Colm Buckley <colm@tuatha.org> Closes #11507	2021-01-23 15:49:32 -08:00
Alexander Motin	5aa69a57da	Relax special_small_blocks assertion. Follow up for commit `624222a`, value asserted <= SPA_OLD_MAXBLOCKSIZE instead of SPA_MAXBLOCKSIZE as it should be after the previous change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11501	2021-01-23 15:45:27 -08:00
Matthew Ahrens	aa755b3549	Set aside a metaslab for ZIL blocks Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11389	2021-01-21 15:12:54 -08:00
Brian Atkinson	d0cd9a5cc6	Extending FreeBSD UIO Struct In FreeBSD the struct uio was just a typedef to uio_t. In order to extend this struct, outside of the definition for the struct uio, the struct uio has been embedded inside of a uio_t struct. Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear this is a ZFS interface. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #11438	2021-01-20 21:27:30 -08:00
Matthew Ahrens	e2af2acce3	allow callers to allocate and provide the abd_t struct The `abd_get_offset_()` routines create an abd_t that references another abd_t, and doesn't allocate any pages/buffers of its own. In some workloads, these routines may be called frequently, to create many abd_t's representing small pieces of a single large abd_t. In particular, the upcoming RAIDZ Expansion project makes heavy use of these routines. This commit adds the ability for the caller to allocate and provide the abd_t struct to a variant of `abd_get_offset_()`. This eliminates the cost of allocating the abd_t and performing the accounting associated with it (`abdstat_struct_size`). The RAIDZ/DRAID code uses this for the `rc_abd`, which references the zio's abd. The upcoming RAIDZ Expansion project will leverage this infrastructure to increase performance of reads post-expansion by around 50%. Additionally, some of the interfaces around creating and destroying abd_t's are cleaned up. Most significantly, the distinction between `abd_put()` and `abd_free()` is eliminated; all types of abd_t's are now disposed of with `abd_free()`. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #8853 Closes #11439	2021-01-20 11:24:37 -08:00
Matthew Ahrens	2ac90457f5	record ioctl elapsed time in zpool history Each zfs ioctl that changes on-disk state (e.g. set property, create snapshot, destroy filesystem) is recorded in the zpool history, and is printed by `zpool history -i`. For performance diagnostic purposes, it would be useful to know how long each of these ioctls took to run. This commit adds that functionality, with a new `ZPOOL_HIST_ELAPSED_NS` member of the history nvlist. Additionally, the time recorded in this history log is currently the time that the history record is written to disk. But in many cases (CLI args logging and ioctl logging), this happens asynchronously, potentially many seconds after the operation completed. This commit changes the timestamp to reflect when the history event was created, rather than when it was written to disk. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11440	2021-01-11 09:29:25 -08:00
Matthew Ahrens	dc303dcf5b	assertion failed in arc_wait_for_eviction() If the system is very low on memory (specifically, `arc_free_memory() < arc_sys_free/2`, i.e. less than 1/16th of RAM free), `arc_evict_state_impl()` will defer wakups. In this case, the arc_evict_waiter_t's remain on the list, even though `arc_evict_count` has been incremented past their `aew_count`. The problem is that `arc_wait_for_eviction()` assumes that if there are waiters on the list, the count they are waiting for has not yet been reached. However, the deferred wakeups may violate this, causing `ASSERT(last->aew_count > arc_evict_count)` to fail. This commit resolves the issue by having new waiters use the greater of `arc_evict_count` and the last `aew_count`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11285 Closes #11397	2021-01-07 20:06:32 -08:00
Toomas Soome	40ab927ae8	implicit conversion from 'boolean_t' to 'ds_hold_flags_t' Build error on illumos with gcc 10 did reveal: In function 'dmu_objset_refresh_ownership': ../../common/fs/zfs/dmu_objset.c:857:25: error: implicit conversion from 'boolean_t' to 'ds_hold_flags_t' {aka 'enum ds_hold_flags'} [-Werror=enum-conversion] 857 \| dsl_dataset_disown(ds, decrypt, tag); \| ^~~~~~~ cc1: all warnings being treated as errors libzfs_input_check.c: In function 'zfs_ioc_input_tests': libzfs_input_check.c:754:28: error: implicit conversion from 'enum dmu_objset_type' to 'enum lzc_dataset_type' [-Werror=enum-conversion] 754 \| err = lzc_create(dataset, DMU_OST_ZFS, NULL, NULL, 0); \| ^~~~~~~~~~~ cc1: all warnings being treated as errors The same issue is present in openzfs, and also the same issue about ds_hold_flags_t, which currently defines exactly one valid value. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #11406	2020-12-27 16:31:02 -08:00
Brian Behlendorf	0c763f76b1	Remove unused check from dmu_tx_count_write() Individual transactions may not be larger than DMU_MAX_ACCESS. This is enforced by the assertions in dmu_tx_hold_write() and dmu_tx_hold_write_by_dnode(). There's an additional check in dmu_tx_count_write() however it has no effect and only sets a local err variable. We could enable this check, however since it's already enforced by ASSERTs elsewhere I opted to remove it instead. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3731 Closes #11384	2020-12-21 20:17:13 -08:00
Andy Fiddaman	39372fa25b	Dangling reference from dmu_objset_upgrade After porting the fix for https://github.com/openzfs/zfs/issues/5295 over to illumos, we started hitting an assertion failure when running the testsuite: assertion failed: rc->rc_count == number, file: .../refcount.c and the unexpected hold has this stack: dsl_dataset_long_hold+0x59 dmu_objset_upgrade+0x73 dmu_objset_id_quota_upgrade+0x15 dmu_objset_own+0x14f The simplest reproducer for this in illumos is zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool which is run as part of the zpool_create_tempname test, but I can't get this to trigger on FreeBSD. This appears to be because of the call to txg_wait_synced() in dmu_objset_upgrade_stop() (which was missing in illumos), slows down dmu_objset_disown() enough to avoid the condition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andy Fiddaman <andy@omnios.org> Closes #11368	2020-12-21 10:13:23 -08:00
Christian Schwarz	49c482fde3	dsl_pool: extend comment on DSL Pool Configuration Lock Based on a conversation with Matt on the OpenZFS Slack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11370	2020-12-19 18:04:05 -08:00
Brian Behlendorf	1c2358c12a	Linux 5.10 compat: use iov_iter in uio structure As of the 5.10 kernel the generic splice compatibility code has been removed. All filesystems are now responsible for registering a ->splice_read and ->splice_write callback to support this operation. The good news is the VFS provided generic_file_splice_read() and iter_file_splice_write() callbacks can be used provided the ->iter_read and ->iter_write callback support pipes. However, this is currently not the case and only iovecs and bvecs (not pipes) are ever attached to the uio structure. This commit changes that by allowing full iov_iter structures to be attached to uios. Ever since the 4.9 kernel the iov_iter structure has supported iovecs, kvecs, bvevs, and pipes so it's desirable to pass the entire thing when possible. In conjunction with this the uio helper functions (i.e uiomove(), uiocopy(), etc) have been updated to understand the new UIO_ITER type. Note that using the kernel provided uio_iter interfaces allowed the existing Linux specific uio handling code to be simplified. When there's no longer a need to support kernel's older than 4.9, then it will be possible to remove the iovec and bvec members from the uio structure and always use a uio_iter. Until then we need to maintain all of the existing types for older kernels. Some additional refactoring and cleanup was included in this change: - Added checks to configure to detect available iov_iter interfaces. Some are available all the way back to the 3.10 kernel and are used when available. In particular, uio_prefaultpages() now always uses iov_iter_fault_in_readable() which is available for all supported kernels. - The unused UIO_USERISPACE type has been removed. It is no longer needed now that the uio_seg enum is platform specific. - Moved zfs_uio.c from the zcommon.ko module to the Linux specific platform code for the zfs.ko module. This gets it out of libzfs where it was never needed and keeps this Linux specific code out of the common sources. - Removed unnecessary O_APPEND handling from zfs_iter_write(), this is redundant and O_APPEND is already handled in zfs_write(); Reviewed-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11351	2020-12-18 08:48:26 -08:00
Matthew Ahrens	71e4ce0e52	special device removal space accounting fixes The space in special devices is not included in spa_dspace (or dsl_pool_adjustedsize(), or the zfs `available` property). Therefore there is always at least as much free space in the normal class, as there is allocated in the special class(es). And therefore, there is always enough free space to remove a special device. However, the checks for free space when removing special devices did not take this into account. This commit corrects that. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11329	2020-12-17 12:11:56 -08:00
Ryan Moeller	1531506d23	Avoid extra work updating ARC kstats and tunables After `e357046` it should not be necessary to periodically update ARC kstats and tunables. Tunable updates are applied when modified, and kstats are updated on demand. Update kstats in `arc_evict_cb_check()` for `ZFS_DEBUG` builds only. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11237	2020-12-17 11:16:42 -08:00
Matthew Ahrens	be5c6d9653	Only examine best metaslabs on each vdev On a system with very high fragmentation, we may need to do lots of gang allocations (e.g. most indirect block allocations (~50KB) may need to gang). Before failing a "normal" allocation and resorting to ganging, we try every metaslab. This has the impact of loading every metaslab (not a huge deal since we now typically keep all metaslabs loaded), and also iterating over every metaslab for every failing allocation. If there are many metaslabs (more than the typical ~200, e.g. due to vdev expansion or very large vdevs), the CPU cost of this iteration can be very impactful. This iteration is done with the mg_lock held, creating long hold times and high lock contention for concurrent allocations, ultimately causing long txg sync times and poor application performance. To address this, this commit changes the behavior of "normal" (not try_hard, not ZIL) allocations. These will now only examine the 100 best metaslabs (as determined by their ms_weight). If none of these have a large enough free segment, then the allocation will fail and we'll fall back on ganging. To accomplish this, we will now (normally) gang before doing a `try_hard` allocation. Non-try_hard allocations will only examine the 100 best metaslabs of each vdev. In summary, we will first try normal allocation. If that fails then we will do a gang allocation. If that fails then we will do a "try hard" gang allocation. If that fails then we will have a multi-layer gang block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11327	2020-12-16 14:40:05 -08:00
Alexander Motin	f8020c9363	Make metaslab class rotor and aliquot per-allocator. Metaslab rotor and aliquot are used to distribute workload between vdevs while keeping some locality for logically adjacent blocks. Once multiple allocators were introduced to separate allocation of different objects it does not make much sense for different allocators to write into different metaslabs of the same metaslab group (vdev) same time, competing for its resources. This change makes each allocator choose metaslab group independently, colliding with others only sporadically. Test including simultaneous write into 4 files with recordsize of 4KB on a striped pool of 30 disks on a system with 40 logical cores show reduction of vdev queue lock contention from 54 to 27% due to better load distribution. Unfortunately it won't help much ZVOLs yet since only one dataset/ZVOL is synced at a time, and so for the most part only one allocator is used, but it may improve later. While there, to reduce the number of pointer dereferences change per-allocator storage for metaslab classes and groups from several separate malloc()'s to variable length arrays at the ends of the original class and group structures. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11288	2020-12-15 10:55:44 -08:00
Matthew Macy	923d730329	dmu_zfetch: fix memory leak The last change caused the read completion callback to not be called if the IO was still in progress. This change restores allocation of the arc buf callback, but in the callback path checks the new acb_nobuf field to know to skip buffer allocation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11324	2020-12-12 16:00:00 -08:00
George Amanakis	c76a40bfda	Fix reporting of CKSUM errors in indirect vdevs When removing and subsequently reattaching a vdev, CKSUM errors may occur as vdev_indirect_read_all() reads from all children of a mirror in case of a resilver. Fix this by checking whether a child is missing the data and setting a flag (ic_error) which is then checked in vdev_indirect_repair() and suppresses incrementing the checksum counter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11277	2020-12-11 12:15:37 -08:00
Matthew Ahrens	ba67d82142	Improve zfs receive performance with lightweight write The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11105	2020-12-11 10:26:02 -08:00
Paul Dagnelie	7d4b365ce3	Fix kernel panic induced by redacted send In the redaction list traversal code, there is a bug in the binary search logic when looking for the resume point. Maxbufid can be decremented to -1, causing us to read the last possible block of the object instead of the one we wanted. This can cause incorrect resume behavior, or possibly even a hang in some cases. In addition, when examining non-last blocks, we can treat the block as being the same size as the last block, causing us to miss entries in the redaction list when determining where to resume. Finally, we were ignoring the case where the resume point was found in the buffer being searched, and resuming from minbufid. All these issues have been corrected, and the code has been significantly simplified to make future issues less likely. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11297	2020-12-11 10:22:29 -08:00
Paul Dagnelie	60a4c7d2a2	Implement memory and CPU hotplug ZFS currently doesn't react to hotplugging cpu or memory into the system in any way. This patch changes that by adding logic to the ARC that allows the system to take advantage of new memory that is added for caching purposes. It also adds logic to the taskq infrastructure to support dynamically expanding the number of threads allocated to a taskq. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11212	2020-12-10 14:09:23 -08:00
Matthew Macy	1e4732cbda	Decouple arc_read_done callback from arc buf instantiation Add ARC_FLAG_NO_BUF to indicate that a buffer need not be instantiated. This fixes a ~20% performance regression on cached reads due to zfetch changes. Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11220 Closes #11232	2020-12-09 15:05:06 -08:00
Brian Behlendorf	edb20ff3ba	Fix optional "force" arg handing in zfs_ioc_pool_sync() The fnvlist_lookup_boolean_value() function should not be used to check the force argument since it's optional. It may not be provided or may have been created with the wrong flags. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11281 Closes #11284	2020-12-09 14:52:45 -08:00
Brian Behlendorf	83b698dc42	Reduce fletcher4 and raidz benchmark times During module load time all of the available fetcher4 and raidz implementations are benchmarked for a fixed amount of time to determine the fastest available. Manual testing has shown that this time can be significantly reduced with negligible effect on the final results. This commit changes the benchmark time to 1ms which can reduce the module load time by over a second on x86_64. On an x86_64 system with sse3, ssse3, and avx2 instructions the benchmark times are: Fletcher4 603ms -> 15ms RAIDZ 1,322ms -> 64ms Reviewed-by: Matthew Macy <mmacy@freebsd.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11282	2020-12-06 09:57:20 -08:00
Alexander Motin	8136b9d73b	Avoid some spa_has_pending_synctask() calls. Since `8c4fb36a24` (PR #7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11280	2020-12-06 09:55:02 -08:00
George Amanakis	d1d47691c2	Fix raw sends on encrypted datasets when copying back snapshots When sending raw encrypted datasets the user space accounting is present when it's not expected to be. This leads to the subsequent mount failure due a checksum error when verifying the local mac. Fix this by clearing the OBJSET_FLAG_USERACCOUNTING_COMPLETE and reset the local mac. This allows the user accounting to be correctly updated on first mount using the normal upgrade process. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10523 Closes #11221	2020-12-04 14:34:29 -08:00
Alexander Motin	dcf7044522	Fix for "Reduce latency effects of non-interactive I/O" It was found that setting min_active tunables for non-interactive I/Os makes them stuck. It is caused by zfs_vdev_nia_delay, that can never be reached if we never issue any I/Os due to min_active set to zero. Fix this by issuing at least one non-interactive I/O at a time when there are no interactive I/Os. When there are interactive I/Os, zero min_active allows to completely block any non-interactive I/O. It may min_active starvation in some scenarios, but who we are to deny foot shooting? Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11261	2020-12-03 10:02:39 -08:00
Ryan Moeller	0aacde2e9a	FreeBSD: notify userspace when a vdev is removed This is needed for zfsd to autoreplace vdevs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11260	2020-12-02 10:20:02 -08:00
Finix1979	ec50cd24ba	Avoid unneccessary zio allocation and wait In function dmu_buf_hold_array_by_dnode, the usage of zio is only for the reading operation. Only create the zio and wait it in the reading scenario as a performance optimization. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Finix Yan <yancw@info2soft.com> Closes #11251 Closes #11256	2020-12-02 09:28:55 -08:00
Brian Behlendorf	04a82e043d	Remove incorrect assertion Commit `85703f6` added a new ASSERT to zfs_write() as part of the cleanup which isn't correct in the case where multiple processes are concurrently extending a file. The `zp->z_size` is updated atomically while holding a range lock on only a portion of the file. Therefore, it's possible for the file size to increase after a same check is performed earlier in the loop causing this ASSERT to fail. The code itself handles this case correctly so only the invalid ASSERT needs to be removed. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11235	2020-11-24 09:28:42 -08:00
Alexander Motin	6f5aac3ca0	Reduce latency effects of non-interactive I/O Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11166	2020-11-24 09:26:42 -08:00
Brian Behlendorf	4d0ba94113	Correct missing zil_claim() DTL updates Commit `a1d477c2` accidentally disabled DTL updates for the zil_claim() case described at the end of vdev_stat_update() by unconditionally disabling all DTL updates when loading. This was done to avoid a deadlock on the vd_dtl_lock when loading the DTLs from disk. vdev_dtl_contains <--- Takes vd->vd_dtl_lock vdev_mirror_child_missing vdev_mirror_io_start zio_vdev_io_start __zio_execute arc_read dbuf_issue_final_prefetch dbuf_prefetch_impl dbuf_prefetch dmu_prefetch space_map_iterate space_map_load_length space_map_load vdev_dtl_load <--- Takes vd->vd_dtl_lock vdev_load spa_ld_load_vdev_metadata spa_tryimport The missing DTL updates can be restored by moving the space_map_load() call outside the vd_dtl_lock. A private range tree is populated by reading the space map and then merged in to the DTL_MISSING tree under the lock. Furthermore, the SPA_LOAD_NONE check in vdev_dtl_contains() leads to an additional problem. Any resilvering which occurs before SPA_LOAD_NONE is set will incorrectly determine that there's nothing to repair. This can result in full redundancy not being restored for some blocks. Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11218	2020-11-20 13:14:45 -08:00
Ryan Moeller	85703f616d	Reduce confusion in zfs_write Is this block when abuf != NULL ever reached? Yes, it is. Add asserts and comments to prove that when we get here, we have a full block write at an aligned offset extending past EOF. Simplify by removing the check that tx_bytes == max_blksz, since we can assert that it is always true. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11191	2020-11-18 15:06:59 -08:00
Matthew Macy	0ca45cb310	Fix problems in zvol_set_volmode_impl - Don't leave fstrans set when passed a snapshot - Don't remove minor if volmode already matches new value - (FreeBSD) Wait for GEOM ops to complete before trying remove (at create time GEOM will be "tasting" in parallel) - (FreeBSD) Don't leak zvol_state_lock on open if zv == NULL - (FreeBSD) Don't try to unlock zv->zv_state lock if zv == NULL Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11199	2020-11-17 09:50:52 -08:00
loli10K	4072f465bc	Fix 'zfs userspace' for received datasets in encrypted root For encrypted receives, where user accounting is initially disabled on creation, both 'zfs userspace' and 'zfs groupspace' fails with EOPNOTSUPP: this is because dmu_objset_id_quota_upgrade_cb() forgets to set OBJSET_FLAG_USERACCOUNTING_COMPLETE on the objset flags after a successful dmu_objset_space_upgrade(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #9501 Closes #9596	2020-11-16 09:10:29 -08:00
George Amanakis	2c210f6818	Fix ASSERT logic in l2arc_evict() In case of cache device removal it is possible that at the end of l2arc_evict() we have l2ad_hand = l2ad_evict. This can lead to the following panic in case of a debug build: VERIFY3(dev->l2ad_hand < dev->l2ad_evict) failed (321920512 < 321920512) Call Trace: dump_stack+0x66/0x90 spl_panic+0xef/0x117 [spl] l2arc_remove_vdev+0x11d/0x290 [zfs] spa_load_l2cache+0x275/0x5b0 [zfs] spa_vdev_remove+0x4a5/0x6e0 [zfs] zfs_ioc_vdev_remove+0x59/0xa0 [zfs] zfsdev_ioctl_common+0x5b3/0x630 [zfs] zfsdev_ioctl+0x53/0xe0 [zfs] do_vfs_ioctl+0x42e/0x6b0 ksys_ioctl+0x5e/0x90 do_syscall_64+0x5b/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In case of cache device removal it also possible that l2ad_hand + distance > l2ad_end since we do not iterate l2arc_evict() and l2ad_hand is not reset. This has no functional consequence however as the cache device is about to be removed. Fix this by omitting the ASSERT in case of device removal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #11205	2020-11-16 09:08:11 -08:00
Matthew Ahrens	d66aab7c08	Assertion failure when logging large output of channel program The output of ZFS channel programs is logged on-disk in the zpool history, and printed by `zpool history -i`. Channel programs can use 10MB of memory by default, and up to 100MB by using the `zfs program -m` flag. Therefore their output can be up to some fraction of 100MB. In addition to being somewhat wasteful of the limited space reserved for the pool history (which for large pools is 1GB), in extreme cases this can result in a failure of `ASSERT(length <= DMU_MAX_ACCESS);` in `dmu_buf_hold_array_by_dnode()`. This commit limits the output size that will be logged to 1MB. Larger outputs will not be logged, instead a entry will be logged indicating the size of the omitted output. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11194	2020-11-14 10:17:16 -08:00
Ryan Moeller	7e3617de35	Return EFAULT at the end of zfs_write() when set FreeBSD's VFS expects EFAULT from zfs_write() if we didn't complete the full write so it can retry the operation. Add some missing SET_ERRORs in zfs_write(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11193	2020-11-14 10:16:26 -08:00
Brian Behlendorf	b2255edcc0	Distributed Spare (dRAID) Feature This patch adds a new top-level vdev type called dRAID, which stands for Distributed parity RAID. This pool configuration allows all dRAID vdevs to participate when rebuilding to a distributed hot spare device. This can substantially reduce the total time required to restore full parity to pool with a failed device. A dRAID pool can be created using the new top-level `draid` type. Like `raidz`, the desired redundancy is specified after the type: `draid[1,2,3]`. No additional information is required to create the pool and reasonable default values will be chosen based on the number of child vdevs in the dRAID vdev. zpool create <pool> draid[1,2,3] <vdevs...> Unlike raidz, additional optional dRAID configuration values can be provided as part of the draid type as colon separated values. This allows administrators to fully specify a layout for either performance or capacity reasons. The supported options include: zpool create <pool> \ draid[<parity>][:<data>d][:<children>c][:<spares>s] \ <vdevs...> - draid[parity] - Parity level (default 1) - draid[:<data>d] - Data devices per group (default 8) - draid[:<children>c] - Expected number of child vdevs - draid[:<spares>s] - Distributed hot spares (default 0) Abbreviated example `zpool status` output for a 68 disk dRAID pool with two distributed spares using special allocation classes. ``` pool: tank state: ONLINE config: NAME STATE READ WRITE CKSUM slag7 ONLINE 0 0 0 draid2:8d:68c:2s-0 ONLINE 0 0 0 L0 ONLINE 0 0 0 L1 ONLINE 0 0 0 ... U25 ONLINE 0 0 0 U26 ONLINE 0 0 0 spare-53 ONLINE 0 0 0 U27 ONLINE 0 0 0 draid2-0-0 ONLINE 0 0 0 U28 ONLINE 0 0 0 U29 ONLINE 0 0 0 ... U42 ONLINE 0 0 0 U43 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 L5 ONLINE 0 0 0 U5 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 L6 ONLINE 0 0 0 U6 ONLINE 0 0 0 spares draid2-0-0 INUSE currently in use draid2-0-1 AVAIL ``` When adding test coverage for the new dRAID vdev type the following options were added to the ztest command. These options are leverages by zloop.sh to test a wide range of dRAID configurations. -K draid\|raidz\|random - kind of RAID to test -D <value> - dRAID data drives per group -S <value> - dRAID distributed hot spares -R <value> - RAID parity (raidz or dRAID) The zpool_create, zpool_import, redundancy, replacement and fault test groups have all been updated provide test coverage for the dRAID feature. Co-authored-by: Isaac Huang <he.huang@intel.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Co-authored-by: Don Brady <don.brady@delphix.com> Co-authored-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10102	2020-11-13 13:51:51 -08:00
Matthew Ahrens	a724db0374	Channel program may spuriously fail with "memory limit exhausted" ZFS channel programs (invoked by `zfs program`) are executed in a LUA sandbox with a limit on the amount of memory they can consume. The limit is 10MB by default, and can be raised to 100MB with the `-m` flag. If the memory limit is exceeded, the LUA program exits and the command fails with a message like `Channel program execution failed: Memory limit exhausted.` The LUA sandbox allocates memory with `vmem_alloc(KM_NOSLEEP)`, which will fail if the requested memory is not immediately available. In this case, the program fails with the same message, `Memory limit exhausted`. However, in this case the specified memory limit has not been reached, and the memory may only be temporarily unavailable. This commit changes the LUA memory allocator `zcp_lua_alloc()` to use `vmem_alloc(KM_SLEEP)`, so that we won't spuriously fail when memory is temporarily low. Instead, we rely on the system to be able to free up memory (e.g. by evicting from the ARC), and we assume that even at the highest memory limit of 100MB, the channel program will not truly exhaust the system's memory. External-issue: DLPX-71924 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #11190	2020-11-11 17:16:15 -08:00
Mateusz Guzik	18ca574f0a	G/C data_alloc_arena It is a leftover from illumos always set to NULL and introducing a spurious difference between zio_buf and zio_data_buf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11188	2020-11-11 17:11:32 -08:00
Ryan Moeller	d1dd72a2c5	Simplify offset and length limit in zfs_write Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-11-10 10:58:59 -08:00
Ryan Moeller	9a764716fc	Const some unchanging variables in zfs_write Show that these values will not be changing later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-11-10 10:58:59 -08:00
Ryan Moeller	8a9634e2f3	Remove redundant oid parameter to update_pages The oid comes from the znode we are already passing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-11-10 10:54:30 -08:00
Ryan Moeller	eec6646ea9	Factor uid, gid, and projid out of loop in zfs_write Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11176	2020-11-10 10:53:19 -08:00
Alexander Motin	daabddaac1	Fix dmu_tx_dirty_throttle after arc_c reduction After initial arc_c was reduced to arc_c_min it became possible that on datasets with primarycache=metadata or none dirty data make up most of ARC capacity and easily more than configured 50% of initial arc_c, that causes forced txg commits by arc_tempreserve_space() and periodic very long write delays. This patch makes arc_tempreserve_space() to use arc_c only after ARC warmed up once and arc_c really means something, but use arc_c_max before that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #11178	2020-11-10 10:39:26 -08:00
Matthew Macy	570d7038d0	Fix dnode refcount tracking Fix a couple of places where the wrong tag is passed to dnode_{hold, rele} Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11184	2020-11-10 10:37:10 -08:00
Christian Schwarz	ab8c935ea6	zfs_vnops: make zfs_get_data OS-independent Move zfs_get_data() in to platform-independent code. The only platform-specific aspect of it is the way we release an inode (Linux) / vnode_t (FreeBSD). I am not aware of a platform that could be supported by ZFS that couldn't implement zfs_rele_async itself. It's sibling zvol_get_data already is platform-independent. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #10979	2020-11-02 12:07:07 -08:00
Mateusz Guzik	09eb36ce3d	Introduce CPU_SEQID_UNSTABLE Current CPU_SEQID users don't care about possibly changing CPU ID, but enclose it within kpreempt disable/enable in order to fend off warnings from Linux's CONFIG_DEBUG_PREEMPT. There is no need to do it. The expected way to get CPU ID while allowing for migration is to use raw_smp_processor_id. In order to make this future-proof this patch keeps CPU_SEQID as is and introduces CPU_SEQID_UNSTABLE instead, to make it clear that consumers explicitly want this behavior. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Closes #11142	2020-11-02 11:51:12 -08:00
Matthew Macy	8583540c6e	Consolidate zfs_holey and zfs_access The zfs_holey() and zfs_access() functions can be made common to both FreeBSD and Linux. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11125	2020-10-31 09:40:08 -07:00
Matthew Macy	5fa356ea44	Remove UIO_ZEROCOPY functions structures The original xuio zero copy functionality has always been unused on Linux and FreeBSD. Remove this disabled code to avoid any confusion and improve readability. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11124	2020-10-30 10:00:33 -07:00
Alexander Motin	1199c3e8fb	Yield periodically when rebuilding L2ARC L2ARC devices of several terabytes filled with 4KB blocks may take 15 minutes to rebuild. Due to the way L2ARC log reading is implemented it is quite likely that for all that time rebuild thread will never sleep. At least on FreeBSD kernel threads have absolute priority and can not be preempted by threads with lower priorities. If some thread is also bound to that specific CPU it may not get any CPU time for all the 15 minutes. Reviewed-by: Cedric Berger <cedric@precidata.com> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11116	2020-10-30 08:57:54 -07:00
Ryan Moeller	76d04993a6	Update references to nonexistent man pages in code Refer to the correct section or alternative for FreeBSD and Linux. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11132	2020-10-30 08:55:59 -07:00
Ryan Moeller	eb02a4c6fb	Add missing zfs_arc_evict_batch_limit tunable It's even documented already. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11094	2020-10-22 10:18:26 -07:00
Matthew Macy	e53d678d4a	Share zfs_fsync, zfs_read, zfs_write, et al between Linux and FreeBSD The zfs_fsync, zfs_read, and zfs_write function are almost identical between Linux and FreeBSD. With a little refactoring they can be moved to the common code which is what is done by this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11078	2020-10-21 14:08:06 -07:00
Adam D. Moss	666aa69f32	Non-l2arc pool reads shouldn't be l2arc misses The current l2_misses accounting behavior treats all reads to pools without a configured l2arc as an l2arc miss, IFF there is at least one other pool on the system which does have an l2arc configured. This makes it extremely hard to tune for an improved l2arc hit/miss ratio because this ratio will be modulated by reads from pools which do not (and should not) have l2arc devices; its upper limit will depend on the ratio of reads from l2arc'd pools and non-l2arc'd pools. This PR prevents ARC reads affecting l2arc stats (n.b. l2_misses is the only relevant one) where the target spa doesn't have an l2arc. Includes new test - l2arc_l2miss_pos.ksh Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Adam Moss <c@yotes.com> Closes #10921	2020-10-20 11:39:52 -07:00
Don Brady	dff71c7936	Ignore special vdev ashift for spa ashift min/max The removal of a vdev in the normal class would fail if there was a special or deup vdev that had a different ashift than the vdevs in the normal class. Moved the initialization of spa_min_ashift / spa_max_ashift from vdev_open so that it occurs after the vdev allocation bias was initialized (i.e. after vdev_load). Caveat -- In order to remove a special/dedup vdev it must have the same ashift as the normal pool vdevs. This could perhaps be lifted in the future (i.e. for the case where there is ample space in any surviving special class vdevs) Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #9363 Closes #9364 Closes #11053	2020-10-15 14:45:16 -07:00
Christian Schwarz	15a4ca4620	Fix crash caused by invalid snapshot names in redactnvl This is a follow up fix for commit `0fdd6106bb`. The VERIFY is only true when we haven't hit an error code path. See added test case for a reproducer. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11048	2020-10-14 14:04:19 -07:00
Paul Dagnelie	6a60ef80e2	Fix incorrect deletion order in range_tree_add_impl gap case After a side-effectful call like add or remove, references to range segs stored in btrees can no longer be used safely. We move the remove call to just before the reinsertion call so that the seg remains valid for as long as we need it. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11044 Closes #11056	2020-10-14 08:59:54 -07:00
Matthew Macy	57dc5d42b1	dmu_zfetch: don't leak unreferenced stream when zfetch is freed Currently streams are only freed when: - They have no referencing zfetch and and their I/O references go to zero. - They are more than 2s old and a new I/O request comes in on the same zfetch. This means that we will leak unreferenced streams when their zfetch structure is freed. This change checks the reference count on a stream at zfetch free time. If it is zero we free it immediately. If it has remaining references we allow the prefetch callback to free it at I/O completion time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #11052	2020-10-13 21:03:36 -07:00
Ryan Moeller	7dfc56d866	Expose zfetch_max_idistance tunable FreeBSD had this value tunable before the switch to the new OpenZFS. The tunable name has changed, breaking legacy compat. Restore legacy compat for this tunable, properly expose the tunable with the new name on all platforms, and document it in zfs-module-parameters(5). While here, clean up the documentation for zfetch_max_distance a bit. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #11038	2020-10-13 09:32:34 -07:00
Christian Schwarz	61868bb14d	zil_parse: make callback parameters const Code cleanup, a follow up commit to `4d55ea81`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Co-authored-by: Ryan Moeller <ryan@freqlabs.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #11020	2020-10-09 09:34:54 -07:00
Brian Behlendorf	d0249a4bd0	Replace ZFS on Linux references with OpenZFS This change updates the documentation to refer to the project as OpenZFS instead ZFS on Linux. Web links have been updated to refer to https://github.com/openzfs/zfs. The extraneous zfsonlinux.org web links in the ZED and SPL sources have been dropped. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #11007	2020-10-08 20:10:13 -07:00
Chuck Tuffli	a8fc1b8743	Fix ubsan: shift exponent is too large When running libzpool with the Undefined Behavior Sanitizer (ubsan) enabled, a zpool create causes a run-time error: module/zfs/vdev_label.c:600:14: runtime error: shift exponent 64 is too large for 64-bit type 'long long unsigned int'` in vdev_config_generate() Fix is to convert vdev_removal_max_span to its base-2 logarithm, using highbit64(), and then compare the "shifts". Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Chuck Tuffli <ctuffli@gmail.com> Closes #9744 Closes #11024	2020-10-08 16:37:27 -07:00
George Amanakis	a76e4e6761	Make L2ARC tests more robust Instead of relying on arbitrary timers after pool export/import or cache device off/online rely on arcstats. This makes the L2ARC tests more robust. Also cleanup some functions related to persistent L2ARC. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10983	2020-10-05 15:29:05 -07:00
Ryan Moeller	4d55ea811d	Throw const on some strings In C, const indicates to the reader that mutation will not occur. It can also serve as a hint about ownership. Add const in a few places where it makes sense. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10997	2020-10-02 17:44:10 -07:00
John Poduska	5b525165e9	Mismatched nvlist names in zfs_keys_send_space This causes "zfs send -vt ..." to fail with: cannot resume send: Unknown error 1030 It turns out that some of the name/value pairs in the verification list for zfs_ioc_send_space(), zfs_keys_send_space, had the wrong name, so the ioctl got kicked out in zfs_check_input_nvpairs(). Update the names accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #10978	2020-10-02 17:40:46 -07:00
Matthew Macy	1cb8202b1b	Eliminate gratuitous bzeroing in dbuf_stats_hash_table_data `dbuf_stats_hash_table_data` can take much longer than it needs to by repeatedly bzeroing its buffer when in fact the buffer only needs to be NULL terminated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10993	2020-09-30 13:24:38 -07:00
Sebastian Gottschall	8a171ccd92	do a cyclic seek for unused memory objects in pool In non regular use cases allocated memory might stay persistent in memory pool. This small patch checks every minute if there are old objects which can be released from memory pool. Right now with regular use, the pool is checked for old objects on each allocation attempt from this pool. so basically polling by its use. Now consider what happens if someone writes a lot of files and stops use of the volume or even unmounts it. So the code will no longer check if objects can be released from the pool. Already allocated objects will still stay in pool cache. this is no big issue for common use. But someone discovered this issue while doing tests. personally i know this behavior and I'm aware of it. Its no big issue. just a enhancement Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Closes #10938 Closes #10969	2020-09-30 13:22:34 -07:00
Ryan Moeller	c0bd2e0fe2	Drop references when skipping dmu_send due to EXDEV When an invalid incremental send is requested where the "to" ds is before the "from" ds, make sure to drop the reference to the pool and the dataset before returning the error. Add an assert on FreeBSD to make sure we don't hold any locks after returning from an ioctl. Add some test coverage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10919	2020-09-30 13:19:49 -07:00
Matthew Macy	af20b97078	zfetch: Don't issue new streams when old have not completed The current dmu_zfetch code implicitly assumes that I/Os complete within min_sec_reap seconds. With async dmu and a readonly workload (and thus no exponential backoff in operations from the "write throttle") such as L2ARC rebuild it is possible to saturate the drives with I/O requests. These are then effectively compounded with prefetch requests. This change reference counts streams and prevents them from being recycled after their min_sec_reap timeout if they still have outstanding I/Os. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10900	2020-09-27 17:08:38 -07:00
Adam D. Moss	acfd2d4641	Add DB_RF_NOPREFETCH to dbuf_read()s in dnode.c Prefetching of dnodes in dbuf_read() can cause significant mutex contention for some workloads and isn't very helpful. This is because we already get 32 dnodes for each block read, and when iterating over a directory we prefetch the dnodes in the directory. Disable this prefetching to prevent the lock contention. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Submitted-by: Adam Moss <c@yotes.com> Submitted-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Adam Moss <c@yotes.com> Closes #10877 Closes #10953	2020-09-25 13:49:22 -07:00
Ryan Moeller	863e38453e	Prune dead branch reported by Coverity wkey is NULL at every `goto error;`. dcp is never NULL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10884	2020-09-25 13:11:53 -07:00
Christian Schwarz	a5c77dc4d5	zfs_log_write: simplify data copying code for WR_COPIED records lr_write_t records that are WR_COPIED have the record data directly appended to them (see lr_write_t type definition). The data is copied from the debuf using dmu_read_by_dnode. This function was called, only for WR_COPIED records, as part of a short-circuiting if-statement's if-expression. I found this side-effectful call to dmu_read_by_dnode pretty hard to spot. This patch improves readability by moving the call to its own line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Christian Schwarz <me@cschwarz.com> Closes #10956	2020-09-25 13:06:34 -07:00
Matthew Macy	7b8363d7f0	FreeBSD: Add support for procfs_list The procfs_list interface is required by several kstats. Implement this functionality for FreeBSD to provide access to these kstats. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10890	2020-09-23 16:43:51 -07:00
Paul Dagnelie	20dfe8cd3b	Don't set numobjs to UINT64_MAX or near it Resolves an issue with `zfs send` streams from 0.8.4 which prevents them from being received by versions < 0.7. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10911 Closes #10916	2020-09-22 16:16:07 -07:00
George Amanakis	c6f5e9d92f	Restore clearing of L2CACHE flag in arc_read_done() Commit 45152dc removed clearing of L2CACHE flag in arc_read_done() and moved related code in l2arc_write_eligible(). After careful code inspection arc_read_done() is not bypassed in the case of prefetches. Thus restore the old behavior. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: adam moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10951	2020-09-22 16:08:05 -07:00
George Wilson	c494aa7f57	vdev_ashift should only be set once == Motivation and Context The new vdev ashift optimization prevents the removal of devices when a zfs configuration is comprised of disks which have different logical and physical block sizes. This is caused because we set 'spa_min_ashift' in vdev_open and then later call 'vdev_ashift_optimize'. This would result in an inconsistency between spa's ashift calculations and that of the top-level vdev. In addition, the optimization logical ignores the overridden ashift value that would be provided by '-o ashift=<val>'. == Description This change reworks the vdev ashift optimization so that it's only set the first time the device is configured. It still allows the physical and logical ahsift values to be set every time the device is opened but those values are only consulted on first open. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Cedric Berger <cedric@precidata.com> Signed-off-by: George Wilson <gwilson@delphix.com> External-Issue: DLPX-71831 Closes #10932	2020-09-18 12:13:47 -07:00
Pavel Snajdr	9569c31161	Fix stack frame size: dnode_dirty_l1range() Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #10879	2020-09-15 15:55:55 -07:00
Pavel Snajdr	a1c5578ce0	dmu_redact_snap: fix possible memleak Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #10879	2020-09-15 15:55:45 -07:00
Pavel Snajdr	8c0b16e6e9	Fix stack frame size: dmu_redact_snap() Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #10879	2020-09-15 15:55:35 -07:00
Pavel Snajdr	c95625769d	Fix stack frame size: spa_livelist_delete_cb() Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Closes #10879	2020-09-15 15:55:03 -07:00
Toomas Soome	1db9e6e4e4	zfs label bootenv should store data as nvlist nvlist does allow us to support different data types and systems. To encapsulate user data to/from nvlist, the libzfsbootenv library is provided. Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10774	2020-09-15 15:42:27 -07:00
George Amanakis	085321621e	Add L2ARC arcstats for MFU/MRU buffers and buffer content type Currently the ARC state (MFU/MRU) of cached L2ARC buffer and their content type is unknown. Knowing this information may prove beneficial in adjusting the L2ARC caching policy. This commit adds L2ARC arcstats that display the aligned size (in bytes) of L2ARC buffers according to their content type (data/metadata) and according to their ARC state (MRU/MFU or prefetch). It also expands the existing evict_l2_eligible arcstat to differentiate between MFU and MRU buffers. L2ARC caches buffers from the MRU and MFU lists of ARC. Upon caching a buffer, its ARC state (MRU/MFU) is stored in the L2 header (b_arcs_state). The l2_m{f,r}u_asize arcstats reflect the aligned size (in bytes) of L2ARC buffers according to their ARC state (based on b_arcs_state). We also account for the case where an L2ARC and ARC cached MRU or MRU_ghost buffer transitions to MFU. The l2_prefetch_asize reflects the alinged size (in bytes) of L2ARC buffers that were cached while they had the prefetch flag set in ARC. This is dynamically updated as the prefetch flag of L2ARC buffers changes. When buffers are evicted from ARC, if they are determined to be L2ARC eligible then their logical size is recorded in evict_l2_eligible_m{r,f}u arcstats according to their ARC state upon eviction. Persistent L2ARC: When committing an L2ARC buffer to a log block (L2ARC metadata) its b_arcs_state and prefetch flag is also stored. If the buffer changes its arcstate or prefetch flag this is reflected in the above arcstats. However, the L2ARC metadata cannot currently be updated to reflect this change. Example: L2ARC caches an MRU buffer. L2ARC metadata and arcstats count this as an MRU buffer. The buffer transitions to MFU. The arcstats are updated to reflect this. Upon pool re-import or on/offlining the L2ARC device the arcstats are cleared and the buffer will now be counted as an MRU buffer, as the L2ARC metadata were not updated. Bug fix: - If l2arc_noprefetch is set, arc_read_done clears the L2CACHE flag of an ARC buffer. However, prefetches may be issued in a way that arc_read_done() is bypassed. Instead, move the related code in l2arc_write_eligible() to account for those cases too. Also add a test and update manpages for l2arc_mfuonly module parameter, and update the manpages and code comments for l2arc_noprefetch. Move persist_l2arc tests to l2arc. Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10743	2020-09-14 10:10:44 -07:00
Olaf Faaland	a74259cea0	Initialize mmp_last_write when the mmp thread starts A great deal of time may go by between when mmp_init() is called and the MMP thread starts, particularly if there are bad devices, because there is I/O checking configs etc. If this time is too long, (gethrtime() - mmp_last_write) > mmp_fail_ns at the time the MMP thread starts. If MMP is configured to suspend the pool, the pool will be suspended immediately. This can be seen in issue #10838 The value of mmp_last_write doesn't matter before the mmp thread starts. To give the MMP thread time to issue and land MMP writes, initialize mmp_last_write when the MMP thread starts. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #10873	2020-09-09 10:12:54 -07:00
George Amanakis	feb3a7eef1	Introduce ZFS module parameter l2arc_mfuonly In certain workloads it may be beneficial to reduce wear of L2ARC devices by not caching MRU metadata and data into L2ARC. This commit introduces a new tunable l2arc_mfuonly for this purpose. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10710	2020-09-08 11:44:37 -07:00
Toomas Soome	189272f78a	dnode_special_open() error: unchecked function return 'zrl_tryenter' Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10876	2020-09-08 11:36:52 -07:00
Matthew Macy	7432d29760	FreeBSD: reduce priority of ZIO_TASKQ_ISSUE writes by a larger value On FreeBSD, if priorities divided by four (RQ_PPQ) are equal then a difference between them is insignificant. In other words, incrementing pri by only one as on Linux is insufficient. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10872	2020-09-04 11:13:27 -07:00
Brian Behlendorf	dce63135ae	Sequential scrub and resilver updated comments Commit `d4a72f2` which introduced multi-phase scrubs and resilvers continued the work presented by Nexenta at the 2016 ZFS developer summit. Update the source to reflect their contribution. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2020-09-04 10:51:51 -07:00
Don Brady	4f07282786	Avoid posting duplicate zpool events Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #10861	2020-09-04 10:34:28 -07:00
Matthew Ahrens	3808032489	nowait synctask must succeed If a `zfs_space_check_t` other than `ZFS_SPACE_CHECK_NONE` is used with `dsl_sync_task_nowait()`, the sync task may fail due to ENOSPC. However, there is no way to notice or communicate this failure, so it's extremely difficult to use this functionality correctly, and in fact almost all callers use `ZFS_SPACE_CHECK_NONE`. This commit removes the `zfs_space_check_t` argument from `dsl_sync_task_nowait()`, and always uses `ZFS_SPACE_CHECK_NONE`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10855	2020-09-04 10:29:39 -07:00
Ryan Moeller	cd80273909	Retain thread name when resuming a zthr When created, a zthr is given a name to identify it by. This name is lost when a cancelled zthr is resumed. Retain the name of a zthr so it can be used when resuming. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10881	2020-09-03 20:09:52 -07:00
Matthew Macy	ac6e5fb202	Replace cv_{timed}wait_sig with cv_{timed}wait_idle where appropriate There are a number of places where cv_?_sig is used simply for accounting purposes but the surrounding code has no ability to cope with actually receiving a signal. On FreeBSD it is possible to send signals to individual kernel threads so this could enable undesirable behavior. This patch adds routines on Linux that will do the same idle accounting as _sig without making the task interruptible. On FreeBSD cv__idle are all aliases for cv_ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10843	2020-09-03 20:04:09 -07:00
Ryan Moeller	964791acdc	Make spa_stats.c tunables visible on FreeBSD Use ZFS_MODULE_PARAM for cross-platform tunables in spa_stats.c, and add update tunables.cfg in tests for the newly supported ones. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10858	2020-09-01 16:19:19 -07:00
Matthew Macy	e84e49218f	FreeBSD: Fix up after spa_stats.c move Moving spa_stats added the additional burden of supporting KSTAT_TYPE_IO. spa_state_addr will always return a valid value regardless of the value of 'n'. On FreeBSD this will cause an infinite loop as it relies on the raw ops addr routine to indicate that there is no more data. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10860	2020-09-01 16:16:56 -07:00
Ryan Moeller	7b4e27232d	Add 'zfs rename -u' to rename without remounting Allow to rename file systems without remounting if it is possible. It is possible for file systems with 'mountpoint' property set to 'legacy' or 'none' - we don't have to change mount directory for them. Currently such file systems are unmounted on rename and not even mounted back. This introduces layering violation, as we need to update 'f_mntfromname' field in statfs structure related to mountpoint (for the dataset we are renaming and all its children). In my opinion it is worth it, as it allow to update FreeBSD in even cleaner way - in ZFS-only configuration root file system is ZFS file system with 'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs, we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs), update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs and rename it back to system/rootfs while it is mounted as /. Before it was not possible, because unmounting / was not possible. Authored by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Matt Macy <mmacy@freebsd.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10839	2020-09-01 16:14:16 -07:00
Toomas Soome	1144586b57	zio_ereport_post() and zio_ereport_start() return values are ignored use (void) to silence analyzers. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Toomas Soome <tsoome@me.com> Closes #10857	2020-08-31 19:35:11 -07:00
Matthew Macy	7bb18b94c7	Move spa_stats.c to common code Initially it was considered simplest to stub out all of the functions on FreeBSD. Now that FreeBSD supports KSTAT_TYPE_RAW at least some of the functionality should be made available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10842	2020-08-30 14:12:46 -07:00
Patrick Mooney	8d42c98d95	dnode_sync is careless with range tree Because dnode_sync_free_range() must drop dn_mtx during its processing, using it as a callback to range_tree_vacate() is not safe. No other operations (besides destroy) are allowed once range_tree_vacate() has begun, and dropping dn_mtx would leave a window open for another thread to observe that invalid (and unsafe) state via dnode_block_freed(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Patrick Mooney <pmooney@oxide.computer> Closes #10708 Closes #10823	2020-08-26 21:48:29 -07:00
Ryan Moeller	a2f944a140	zpool: Change base URL for ZFS messages to openzfs-docs Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10820	2020-08-26 21:43:06 -07:00
Brian Behlendorf	03f5d2fd6a	Remove duplicate dnode.h include The zfs/sa.c source file accidentally includes sys/dnode.h twice. Remove the second occurrence. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10816 Closes #10819	2020-08-26 21:41:09 -07:00
Paul Dagnelie	4aa3b3bd47	Always track temporary fses and snapshots for accounting The root cause of the issue is that we only occasionally do as the comments in the code suggest and actually ignore the %recv dataset when it comes to filesystem limit tracking. Specifically, the only time we ignore it is when initializing the filesystem and snapshot limit values; when creating a new %recv dataset or deleting one, we always update the bookkeeping. This causes a problem if you init the fs count on a filesystem that already has a %recv dataset, since the bookmarking will be decremented but not incremented. This is resolved in this patch by simply always tracking the %recv dataset as a child. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10791	2020-08-26 21:38:27 -07:00
Matthew Macy	2dbad44710	FreeBSD: disable neon usage The neon support code does not build on FreeBSD, ifdef out references to fix linker issues on arm64. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10809	2020-08-26 09:54:37 -07:00
Alexander Motin	523e1295fe	Introduce limit on size of L2ARC headers Since L2ARC buffers are not evicted on memory pressure, too large amount of headers on system with irrationally large L2ARC can render it slow or even unusable. This change limits L2ARC writes and rebuild if unevictable L2ARC-only headers reach dangerous level. While there, call arc_adapt() on L2ARC rebuild, so that it could properly grow arc_c, reflecting potentially significant ARC size increase and avoiding slow growth with hopeless eviction attempts later when "overflow" is detected. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #10765	2020-08-25 14:33:36 -07:00
Brian Behlendorf	94dac3e880	Export dmu_offset_next() symbol Export the dmu_offset_next() symbol for use by Lustre. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10796	2020-08-25 08:34:41 -07:00
Sebastian Gottschall	184df27eef	Avoid symbol collision with in-kernel zstdlib For Linux, when zfs is compiled as an in kernel static variant and the in kernel zstd library is compiled statically into the kernel a symbol collision will occur. This wrapper header renames all of the relevant zstd functions to avoid this problem. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Closes #10775	2020-08-24 12:20:41 -07:00
Ryan Moeller	6fe3498ca3	Import vdev ashift optimization from FreeBSD Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Macy <mmacy@freebsd.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10619	2020-08-21 12:53:17 -07:00
Matthew Ahrens	3dc18995bd	Fix indentation in dnode_free_range() Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10744	2020-08-20 11:45:20 -07:00
Matthew Macy	1c2725a157	FreeBSD: 11.x arc_stats compatibility Removing other_size from arc_stats breaks top in 11.x jails running on HEAD. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10745	2020-08-20 10:55:02 -07:00
Michael Niewöhner	10b3c7f5e4	Add zstd support to zfs This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <allanjude@freebsd.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #6247 Closes #9024 Closes #10277 Closes #10278	2020-08-20 10:30:06 -07:00
Brian Behlendorf	cfd59f904b	Fix ARC aggsum access after arc_state_fini() Commit `85ec5cbae` updated abd_update_scatter_stats() such that it calls arc_space_consume() and arc_space_return() when updating the scatter stats. This requires that the global aggsum value for the ARC be initialized. Normally this is not an issue, however during module unload the l2arc_do_free_on_write() function was called in l2arc_cleanup() after arc_state_fini() destroyed the aggsum values. We can resolve this issue by performing l2arc_do_free_on_write() slightly earlier in arc_fini(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10739	2020-08-18 22:11:34 -07:00
Matthew Macy	716b53d0a1	FreeBSD: Fix UNIX permissions checking Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10727	2020-08-18 09:57:07 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Richard Laager	eaa25f1a8e	Remove GRUB restrictions The GRUB restrictions are based around the pool's bootfs property. Given the current situation where GRUB is not staying current with OpenZFS pool features, having either a non-ZFS /boot or a separate pool with limited features are pretty much the only long-term answers for GRUB support. Only the second case matters in this context. For the restrictions to be useful, the bootfs property would have to be set on the boot pool, because that is where we need the restrictions, as that is the pool that GRUB reads from. The documentation for bootfs describes it as pointing to the root pool. That's also how it's used in the initramfs. ZFS does not allow setting bootfs to point to a dataset in another pool. (If it did, it'd be difficult-to-impossible to enforce these restrictions cross-pool). Accordingly, bootfs is pretty much useless for GRUB scenarios moving forward. Even for users who have only one pool, the existing restrictions for GRUB are incomplete. They don't prevent you from enabling the unsupported checksums, for example. For that reason, I have ripped out all the GRUB restrictions. A little longer-term, I think extending the proposed features=portable system to define a features=grub is a much more useful approach. The user could set that on the boot pool at creation, and things would Just Work. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8627	2020-08-17 23:12:39 -07:00
Matthew Ahrens	85ec5cbae2	Include scatter_chunk_waste in arc_size The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10701	2020-08-17 20:04:04 -07:00
Ryan Moeller	3df0c2fa32	FreeBSD: fix the build with Clang 11 * Cast void * to uintptr_t before casting to boolean_t. * Avoid clashing definition of __asm when not on Linux to prevent duplicate __volatile__. This was already done in some places but not all. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10723	2020-08-17 15:40:17 -07:00
Serapheim Dimitropoulos	b0099072df	Fix typo in btree.c Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10725	2020-08-17 15:25:37 -07:00
Matthew Macy	5f1984f2f8	FreeBSD: fallback to /boot/ to look for zpool.cache Up until now zpool.cache has always lived in /boot on FreeBSD. For the sake of compatibility fallback to /boot if zpool.cache isn't found in /etc/zfs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10720	2020-08-17 14:43:47 -07:00
Ryan Moeller	3eaf76a8d2	Fix l2arc_dev_rebuild_start thread name `thread_create` on FreeBSD stringifies the argument passed as the thread function to create a name for the thread. The thread name for `l2arc_dev_rebuild_start` ended up with `(void ()(void ))` in it. Change the type signature so the function does not need to be cast when creating the thread. Rename the function to `l2arc_dev_rebuild_thread` for clarity and consistency, as well. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10716	2020-08-17 11:02:32 -07:00
Allan Jude	fc34dfba8e	Fix L2ARC reads when compressed ARC disabled When reading compressed blocks from the L2ARC, with compressed ARC disabled, arc_hdr_size() returns LSIZE rather than PSIZE, but the actual read is PSIZE. This causes l2arc_read_done() to compare the checksum against the wrong size, resulting in checksum failure. This manifests as an increase in the kstat l2_cksum_bad and the read being retried from the main pool, making the L2ARC ineffective. Add new L2ARC tests with Compressed ARC enabled/disabled Blocks are handled differently depending on the state of the zfs_compressed_arc_enabled tunable. If a block is compressed on-disk, and compressed_arc is enabled: - the block is read from disk - It is NOT decompressed - It is added to the ARC in its compressed form - l2arc_write_buffers() may write it to the L2ARC (as is) - l2arc_read_done() compares the checksum to the BP (compressed) However, if compressed_arc is disabled: - the block is read from disk - It is decompressed - It is added to the ARC (uncompressed) - l2arc_write_buffers() will use l2arc_apply_transforms() to recompress the block, before writing it to the L2ARC - l2arc_read_done() compares the checksum to the BP (compressed) - l2arc_read_done() will use l2arc_untransform() to uncompress it This test writes out a test file to a pool consisting of one disk and one cache device, then randomly reads from it. Since the arc_max in the tests is low, this will feed the L2ARC, and result in reads from the L2ARC. We compare the value of the kstat l2_cksum_bad before and after to determine if any blocks failed to survive the trip through the L2ARC. Sponsored-by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #10693	2020-08-13 23:31:20 -07:00
Jorgen Lundman	faa296c73c	Release onexit/events with any missed zfsdev_state Linux and FreeBSD will most likely never see this issue. On macOS when kext is unloaded, but zed is still connected, zed will be issued ENODEV. As the cdevsw is released, the kernel will not have zfsdev_release() called to release minor/onexit/events, and it "leaks". This ensures it is cleaned up before unload. Changed the for loop from zsprev, to zsnext style, for less code duplication. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10700	2020-08-13 15:03:23 -07:00
Matthew Ahrens	d64c6a2eee	Use zfs_dbgmsg to log metaslab_load/unload Metaslabs are now (usually) loaded and unloaded infrequently, but when that is not the case, it is useful to have a log of when and why these events happened. This commit enables the zfs_dbgmsg() in metaslab_load(), and adds a zfs_dbgmsg() in metaslab_unload(). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10683	2020-08-12 10:10:50 -07:00
Matthew Macy	e111c80247	Restore ARC MFU/MRU pressure The arc_adapt() function tunes LRU/MLU balance according to 4 types of cache hits (which is passed as state agrument): ghost LRU, LRU, MRU, ghost MRU. If this function is called with wrong cache hit (state), adaptation will be sub-optimal and performance will suffer. Some time ago upstream received this commit: 6950 ARC should cache compressed data) in arc_read() do next sequence (access to ghost buffer) Before this commit, hit to any ghost list was passed arc_adapt() before call to arc_access() which revive element in cache and change state from ghost to real hit. After this commit, the order of calls was reverted and arc_adapt() is now called only with «real» hits even if hit was in one of two ghost lists, which renders ghost lists useless and breaks the ARC algorithm. FreeBSD fixed this problem locally in Change D19094 / Commit r348772. This change is an adaptation of the above commit to the current arc code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10548 Closes #10618	2020-08-12 10:03:24 -07:00
Allan Jude	9777044f1c	Fix typo Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #10694	2020-08-11 13:16:57 -07:00
Paul Dagnelie	12045d0278	Clarify error message when a range-tree double-add occurs In various other pieces of logic have resulted in situations where we double-free space in ZFS. This in turn results in a double-add to the range trees. These issues have been much more difficult to diagnose than they should have been, because the error handling around this case is much weaker than around the double remove case. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10654	2020-08-07 14:13:13 -07:00
Matthew Ahrens	d87676a9fa	Fix i/o error handling of livelists and zap iteration Pool-wide metadata is stored in the MOS (Meta Object Set). This metadata is stored in triplicate, in addition to any pool-level reduncancy (e.g. RAIDZ). However, if all 3+ copies of this metadata are not available, we can still get EIO/ECKSUM when reading from the MOS. If we encounter such an error in syncing context, we have typically already committed to making a change that we now can't do because of the corrupt/missing metadata. We typically "handle" this with a `VERIFY()` or `zfs_panic_recover()`. This prevents the system from continuing on in an undefined state, while minimizing the amount of error-handling code. However, there are some code paths that ignore these i/o errors, or `ASSERT()` that they don't happen. Since assertions are disabled on non-debug builds, they effectively ignore them as well. This can lead to ZFS continuing on in an incorrect state, potentially leading to on-disk inconsistencies. This commit adds handling for these i/o errors on MOS metadata, typically with a `VERIFY()`: * Handle error return from `zap_cursor_retrieve()` in 4 places in `dsl_deadlist.c`. * Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`. Turns out this call isn't necessary because we can always call `zap_lookup()`. * Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`. * Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`. * Handle error return from `zap_lookup()` in `dsl_dir_remove_livelist()`. * Handle error return from `dsl_process_sub_livelist()` in `spa_livelist_delete_cb()`. Additionally: * Augment the internal history log message for `zfs destroy` to note which method is used (e.g. bptree, livelist, or, synchronous) and the mintxg. * Correct a comment in `dbuf_init()`. * Correct indentation in `dsl_dir_remove_livelist()`. Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10643	2020-08-05 10:22:09 -07:00
Matthew Macy	22dcf89181	Add missed thread_exit() to vdev_{autotrim,rebuild}_thread Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10668	2020-08-05 10:17:07 -07:00
George Amanakis	da60484db5	Fix logging in l2arc_rebuild() In case the L2ARC rebuild was canceled, do not log to spa history log as the pool may be in the process of being removed and a panic may occur: BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:spa_history_log_internal+0xb1/0x120 [zfs] Call Trace: l2arc_rebuild+0x464/0x7c0 [zfs] l2arc_dev_rebuild_start+0x2d/0x130 [zfs] ? l2arc_rebuild+0x7c0/0x7c0 [zfs] thread_generic_wrapper+0x78/0xb0 [spl] kthread+0xfb/0x130 ? IS_ERR+0x10/0x10 [spl] ? kthread_park+0x90/0x90 ret_from_fork+0x35/0x40 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10659	2020-08-01 11:17:18 -07:00
Allan Jude	8fb79fdddb	Change the error handling for invalid property values ZFS recv should return a useful error message when an invalid index property value is provided in the send stream properties nvlist With a compression= property outside of the understood range: Before: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 internal error: Invalid argument Aborted (core dumped) ``` Note: the recv completes successfully, the abort() is likely just to make it easier to track the unexpected error code. After: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 cannot receive compression property on testpool/recv: invalid property value received 28.9M stream in 1 seconds (28.9M/sec) ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #10631	2020-08-01 08:41:31 -07:00
Matthew Macy	47ed79ff60	Changes to make openzfs build within FreeBSD buildworld A collection of header changes to enable FreeBSD to build with vendored OpenZFS. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10635	2020-07-31 21:30:31 -07:00
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Allan Jude	eabf270b2c	Remove duplicate include of sys/zfeature.h in dmu_objset.c Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #10636	2020-07-31 09:04:45 -07:00
Matthew Ahrens	948423a3d1	zfs promote does not delete livelist of origin When a clone is promoted, its livelist is no longer accurate, so it is discarded. If the clone's origin is also a clone (i.e. we are promoting a clone of a clone), then the origin's livelist is also no longer accurate, so it should be discarded, but the code doesn't actually do that. Consider a pool with: * Filesystem A * Clone B, a clone of A * Clone C, a clone of B If we promote C, it discards C's livelist. It should discard B's livelist, but that is not happening. The impact is that when B is destroyed, we use the livelist to find the blocks to free, but the livelist is no longer correct so we end up freeing blocks that are still in use by C. The incorrectly-freed blocks can be reallocated causing checksum errors. And when C is destroyed it can double-free the incorrectly-freed blocks. The problem is that we remove the livelist of `origin_ds->ds_dir`, but the origin snapshot has already been moved to the promoted dsl_dir. So this is actually trying to remove the livelist of the promoted dsl_dir, which was already removed. As explained in a comment in the beginning of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the origin's dsl_dir. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10652	2020-07-31 08:59:00 -07:00
Matthew Ahrens	3a92552f75	Fix error handling of vdev_top_zap In `vdev_load()`, we look up several entries in the `vdev_top_zap` object. In most cases, if we encounter an i/o error, it will be returned to the caller. However, when handling `VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue on, which in theory could cause us to not realize that a vdev should be used only for `special` allocations. In practice, if we encountered an i/o error while looking for `VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an i/o error while looking for other entries in the same object, and thus the zpool open/import would fail. Therefore the impact of this problem is negligible. This commit adds error handling for i/o errors while accessing the `vdev_top_zap`, so that we aren't relying on unrelated code to fail for us. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10637	2020-07-29 17:04:34 -07:00
Matthew Macy	27d96d2254	Rename refcount.h to zfs_refcount.h Renamed to avoid conflicting with refcount.h when a different implementation is already provided by the platform. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10620	2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos	843e9ca2e1	Introduce names for ZTHRs When debugging issues or generally analyzing the runtime of a system it would be nice to be able to tell the different ZTHRs running by name rather than having to analyze their stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10630	2020-07-29 09:43:33 -07:00
Matthew Macy	5678d3f593	Prefix zfs internal endian checks with _ZFS FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN LITTLE_ENDIAN on every architecture. Trying to do cross builds whilst hiding this from ZFS has proven extremely cumbersome. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10621	2020-07-28 13:02:49 -07:00
Matthew Macy	e64cc4954c	Refactor ccompile.h to not include system headers This is a step toward being able to vendor the OpenZFS code in FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10625	2020-07-25 20:09:50 -07:00
Matthew Macy	6d8da84106	Make use of ZFS_DEBUG consistent within kmod sources Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10623	2020-07-25 20:07:44 -07:00
Matthew Macy	f5b189f937	FreeBSD: Fixes required to build ZFS on PowerPC Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10622	2020-07-25 11:00:23 -07:00
Brian Atkinson	6fba7bfd0e	Add gang ABD child to parent gang ABD By design a gang ABD can not have another gang ABD as a child. This is to make sure the logical offset in a gang ABD is consistent with the individual ABDS it contains as children. If a gang ABD is added as a child of a gang ABD we will add the individual children of the gang ABD to the parent gang ABD. This allows for a consistent view of offsets within the parent gang ABD. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10430	2020-07-24 21:09:20 -07:00
Ryan Moeller	8348fac30c	Limit dbuf cache sizes based only on ARC target size by default Set the initial max sizes to ULONG_MAX to allow the caches to grow with the ARC. Recalculate the metadata cache size on demand so it can adapt, too. Update descriptions in zfs-module-parameters(5). Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10563 Closes #10610	2020-07-24 20:38:48 -07:00
Matthew Ahrens	5dd92909c6	Adjust ARC terminology The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10592	2020-07-22 09:51:47 -07:00
Matthew Ahrens	026e529cb3	Remove skc_reclaim, hdr_recl, kmem_cache shrinker The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`, whereby individual caches can register a callback to be invoked when there is memory pressure. This mechanism is used in only one place: the ARC registers the `hdr_recl()` reclaim function. This function wakes up the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and `arc_reduce_target_size()`. The `skc_reclaim` callbacks are invoked only by shrinker callbacks and `arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When called from shrinker callbacks, we are already aware of memory pressure and responding to it. Therefore there is little benefit to ever calling the `hdr_recl()` `skc_reclaim` callback. The `arc_reap_zthr` also wakes once a second, and if memory is low when allocating an ARC buffer. Therefore, additionally waking it from the shrinker calbacks has little benefit. The shrinker callbacks can be invoked very frequently, e.g. 10,000 times per second. Additionally, for invocation of the shrinker callback, skc_reclaim is invoked many times. Therefore, this mechanism consumes significant amounts of CPU time. The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which, in addition to invoking `skc_reclaim()`, does two things to attempt to free pages for use by the system: 1. Return free objects from the magazine layer to the slab layer 2. Return entirely-free slabs to the page layer (i.e. free pages) These actions apply only to caches implemented by the SPL, not those that use the underlying kernel SLAB/SLUB caches. The SPL caches are used for objects >=32KB, which are primarily linear ABD's cached in the DBUF cache. These actions (freeing objects from the magazine layer and returning entirely-free slabs) are also taken whenever a `kmem_cache_free()` call finds a full magazine. So there would typically be zero entirely-free slabs, and the number of objects in magazines is limited (typically no more than 64 objects per magazine, and there's one magazine per CPU). Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is modest. We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when memory pressure is detected. Therefore, calling `spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed. This commit removes the `skc_reclaim` mechanism, its only callback `hdr_recl()`, and the kmem_cache shrinker callback. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10576	2020-07-19 09:58:30 -07:00
Matthew Ahrens	6774931dfa	Extend zdb to print inconsistencies in livelists and metaslabs Livelists and spacemaps are data structures that are logs of allocations and frees. Livelists entries are block pointers (blkptr_t). Spacemaps entries are ranges of numbers, most often used as to track allocated/freed regions of metaslabs/vdevs. These data structures can become self-inconsistent, for example if a block or range can be "double allocated" (two allocation records without an intervening free) or "double freed" (two free records without an intervening allocation). ZDB (as well as zfs running in the kernel) can detect these inconsistencies when loading livelists and metaslab. However, it generally halts processing when the error is detected. When analyzing an on-disk problem, we often want to know the entire set of inconsistencies, which is not possible with the current behavior. This commit adds a new flag, `zdb -y`, which analyzes the livelist and metaslab data structures and displays all of their inconsistencies. Note that this is different from the leak detection performed by `zdb -b`, which checks for inconsistencies between the spacemaps and the tree of block pointers, but assumes the spacemaps are self-consistent. The specific checks added are: Verify livelists by iterating through each sublivelists and: - report leftover FREEs - report double ALLOCs and double FREEs - record leftover ALLOCs together with their TXG [see Cross Check] Verify spacemaps by iterating over each metaslab and: - iterate over spacemap and then the metaslab's entries in the spacemap log, then report any double FREEs and double ALLOCs Verify that livelists are consistenet with spacemaps. The space referenced by livelists (after using the FREE's to cancel out corresponding ALLOCs) should be allocated, according to the spacemaps. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-66031 Closes #10515	2020-07-14 17:51:05 -07:00
Alexander Motin	1743c737f5	Fix LOR between dp_config_rwlock and spa_props_lock Our QE team during automated API testing hit deadlock in ZFS, caused by lock order reversal. From one side dsl_sync_task_sync() locks dp_config_rwlock as writer and calls spa_sync_props(), which waits for spa_props_lock. From another spa_prop_get() locks spa_props_lock and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock as reader. This patch makes spa_prop_get() lock dp_config_rwlock before spa_props_lock, making the order consistent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #10553	2020-07-14 12:21:57 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Matthew Ahrens	e59a377a8f	filesystem_limit/snapshot_limit is incorrectly enforced against root The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8226 Closes #10545	2020-07-11 17:18:02 -07:00
George Amanakis	2054f35e56	Fix a persistent L2ARC bug in l2arc_write_done() In case l2arc_write_done() handles a zio that was not successful check that the list of log block pointers is not empty when restoring them in the device header. Otherwise zero them out. In any case perform the actual write updating the device header after the zio of l2arc_write_buffers() completes as l2arc_write_done() may have touched the memory holding the log block pointers in the device header. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10540 Closes #10543	2020-07-10 14:10:03 -07:00
Mark Johnston	6e00561712	Add a "try" operation for range locks zfs_rangelock_tryenter() bails immediately instead of waiting for the lock to become available. This will be used to resolve a deadlock in the FreeBSD page-in code. No functional change intended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10519	2020-07-06 11:53:31 -07:00
Brian Behlendorf	9a49d3f3d3	Add device rebuild feature The device_rebuild feature enables sequential reconstruction when resilvering. Mirror vdevs can be rebuilt in LBA order which may more quickly restore redundancy depending on the pools average block size, overall fragmentation and the performance characteristics of the devices. However, block checksums cannot be verified as part of the rebuild thus a scrub is automatically started after the sequential resilver completes. The new '-s' option has been added to the `zpool attach` and `zpool replace` command to request sequential reconstruction instead of healing reconstruction when resilvering. zpool attach -s <pool> <existing vdev> <new vdev> zpool replace -s <pool> <old vdev> <new vdev> The `zpool status` output has been updated to report the progress of sequential resilvering in the same way as healing resilvering. The one notable difference is that multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. The `zpool wait -t resilver` command was extended to wait on sequential resilvers. From this perspective they are no different than healing resilvers. Sequential resilvers cannot be supported for RAIDZ, but are compatible with the dRAID feature being developed. As part of this change the resilver_restart_* tests were moved in to the functional/replacement directory. Additionally, the replacement tests were renamed and extended to verify both resilvering and rebuilding. Original-patch-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: John Poduska <jpoduska@datto.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10349	2020-07-03 11:05:50 -07:00
Matthew Macy	7ddb753d17	freebsd: changes necessary to coexist with dtrace in tree Fix header conflicts when building zfs with openzfs as a vendor import. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10497	2020-07-01 09:10:08 -07:00
Matthew Ahrens	3c42c9ed84	Clean up OS-specific ARC and kmem code OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10499	2020-06-29 09:01:07 -07:00
Matthew Ahrens	67c0f0dedc	ARC shrinking blocks reads/writes ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to allow the ARC to shrink when the kernel experiences memory pressure. The ARC shrinker changes `arc_c` via a call to `arc_reduce_target_size()`. Before commit `3ec34e5527`, the ARC shrinker would also evict data from the ARC to bring `arc_size` down to the new `arc_c`. However, that commit (seemingly inadvertently) made it so that the ARC shrinker no longer evicts any data or waits for eviction to complete. Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often all the way to `arc_c_min`. Since it doesn't wait for the actual eviction of data from the ARC, this creates a situation where `arc_size` is more than `arc_c` for the several seconds/minutes it takes for `arc_adjust_zthr` to evict data from the ARC. During this time, arc_get_data_impl() will block, so ZFS can't process read/write requests (e.g. from iSCSI, NFS, or read/write syscalls). To ensure that `arc_c` doesn't shrink faster than the adjust thread can keep up, this commit makes the ARC shrinker wait for the eviction to complete, resulting in similar behavior to what we had before commit `3ec34e5527`. Note: commit `3ec34e5527` is `OpenZFS 9284 - arc_reclaim_thread has 2 jobs` and was integrated in December 2018, and is part of ZoL 0.8.x but not 0.7.x. Additionally, when the ARC size is reduced drastically, the `arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any threads that are bound to the same CPU that arc_adjust_zthr is running on will not able to run for a long time. To ensure that CPU-bound threads can make progress, this commit changes `arc_evict_state_impl()` make a voluntary preemption call, `cond_resched()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-70703 Closes #10496	2020-06-26 10:42:27 -07:00
Ryan Moeller	9192f27c1d	Add zfs_multihost_interval tunable handler for FreeBSD This tunable required a handler to be implemented for ZFS_MODULE_PARAM_CALL. Add the handler so the tunable can be declared in common code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10490	2020-06-23 13:32:42 -07:00
Arvind Sankar	0ce2de637b	Add prototypes Add prototypes/move prototypes to header files. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:32 -07:00
Arvind Sankar	60356b1a21	Add include files for prototypes Include the header with prototypes in the file that provides definitions as well, to catch any mismatch between prototype and definition. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:25 -07:00
Arvind Sankar	c3fe42aabd	Remove dead code Delete unused functions. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:18 -07:00
Arvind Sankar	65c7cc49bf	Mark functions as static Mark functions used only in the same translation unit as static. This only includes functions that do not have a prototype in a header file either. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:20:38 -07:00
Matthew Macy	8056a75672	Disambiguate condvar API contract On Illumos callers of cv_timedwait and cv_timedwait_hires can't distinguish between whether or not the cv was signaled or the call timed out. Illumos handles this (for some definition of handles) by calling cv_signal in the return path if we were signaled but the return value indicates instead that we timed out. This would make sense if it were possible to query the the cv for its net signal disposition. However, this isn't possible and, in spite of the fact that there are places in the code that clearly take a different and incompatible path if a timeout value is indicated, this distinction appears to be rather subtle to most developers. This problem is further compounded by the fact that on Linux, calling cv_signal in the return path wouldn't even do the right thing unless there are other waiters. Since it is possible for the caller to independently determine how much time is remaining but it is not possible to query if the cv was in fact signaled, prioritizing signalling over timeout seems like a cleaner solution. In addition, judging from usage patterns within the code itself, it is also less error prone. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10471	2020-06-18 10:17:50 -07:00
Matthew Macy	7564073ed6	Add abd_cache_reap_now for abd_chunk_cache users Apparently missed in the initial port integration was the need to reap the abd_chunk_cache on FreeBSD. This change addresses that oversight. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10474	2020-06-17 21:44:13 -07:00
Jorgen Lundman	4458157bee	zfs_ioctl: saved_poolname can be truncated As it uses kmem_strdup() and kmem_strfree() which both rely on strlen() being the same, but saved_poolname can be truncated causing: SPL: kernel memory allocator: buffer freed to wrong cache SPL: buffer was allocated from kmem_alloc_16, SPL: caller attempting free to kmem_alloc_8. SPL: buffer=0xffffff90acc66a38 bufctl=0x0 cache: kmem_alloc_8 Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10469	2020-06-17 14:30:03 -07:00
Alexander Motin	17ca30185a	Set initial arc_c to arc_c_min instead of arc_c_max For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #10437	2020-06-17 14:27:04 -07:00
Jorgen Lundman	883a40fff4	Add convenience wrappers for common uio usage The macOS uio struct is opaque and the API must be used, this makes the smallest changes to the code for all platforms. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10412	2020-06-14 10:09:55 -07:00
Jorgen Lundman	4f73576ea1	Upstream: zil_commit_waiter() can stall forever On macOS clock_t is unsigned, so when cv_timedwait_hires() returns -1 we loop forever. The conditional was tweaked to ignore signedness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10445	2020-06-14 10:08:21 -07:00
Arvind Sankar	71504277ae	Cleanup linux module kbuild files The linux module can be built either as an external module, or compiled into the kernel, using copy-builtin. The source and build directories are slightly different between the two cases, and currently, compiling into the kernel still refers to some files from the configured ZFS source tree, instead of the copies inside the kernel source tree. There is also duplication between copy-builtin, which creates a Kbuild file to build ZFS inside the kernel tree, and the top-level module/Makefile.in. Fix this by moving the list of modules and the CFLAGS settings into a new module/Kbuild.in, which will be used by the kernel kbuild infrastructure, and using KBUILD_EXTMOD to distinguish the two cases within the Makefiles, in order to choose appropriate include directories etc. Module CFLAGS setting is simplified by using subdir-ccflags-y (available since 2.6.30) to set them in the top-level Kbuild instead of each individual module. The disabling of -Wunused-but-set-variable is removed from the lua and zfs modules. The variable that the Makefile uses is actually not defined, so this has no effect; and the warning has long been disabled by the kernel Makefile itself. The target_cpu definition in module/{zfs,zcommon} is removed as it was replaced by use of CONFIG_SPARC64 in commit `70835c5b75` ("Unify target_cpu handling") os/linux/{spl,zfs} are removed from obj-m, as they are not modules in themselves, but are included by the Makefile in the spl and zfs module directories. The vestigial Makefiles in os and os/linux are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10379 Closes #10421	2020-06-10 09:24:15 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
Matthew Ahrens	7bcb7f0840	File incorrectly zeroed when receiving incremental stream that toggles -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #6224 Closes #10383	2020-06-09 10:41:01 -07:00
George Amanakis	b7654bd794	Trim L2ARC The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224	2020-06-09 10:15:08 -07:00
Pawel Jakub Dawidek	529246df96	Restore support for in-kernel ZFS ioctls In Illumos it is possible to call ioctl functions from within the kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support that, but it doesn't hurt to keep it around, as all the code is there. Before this commit it was a dead code and zc_iflags was always zero. Restore this functionality by allowing to pass a flag to the zfsdev_ioctl_common() function. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #10417	2020-06-08 13:57:22 -07:00
Jorgen Lundman	c9e319faae	Replace sprintf()->snprintf() and strcpy()->strlcpy() The strcpy() and sprintf() functions are deprecated on some platforms. Care is needed to ensure correct size is used. If some platforms miss snprintf, we can add a #define to sprintf, likewise strlcpy(). The biggest change is adding a size parameter to zfs_id_to_fuidstr(). The various *_impl_get() functions are only used on linux and have not yet been updated. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10400	2020-06-07 11:42:12 -07:00
Paul Dagnelie	99b281f1ae	Fix double mutex_init bug in send code It was possible to cause a kernel panic in the send code by initializing an already-initialized mutex, if a record was created with type DATA, destroyed with a different type (bypassing the mutex_destroy call) and then re-allocated as a DATA record again. We tweak the logic to not change the type of a record once it has been created, avoiding the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10374	2020-06-03 19:53:21 -07:00
Ryan Moeller	a9dcfac51c	Periodically update ARC kstats FreeBSD needs arc_adjust_zthr to run periodically for kstats to be updated. A comment in the code suggests this may have been the original intent in illumos as well: `c946d5a913/module/zfs/arc.c (L4697-L4700)` Create the thread with a 1 second timer. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10371	2020-06-03 09:52:38 -07:00
Jorgen Lundman	70a5fc0530	Memory leak in dsl_destroy_snapshots_nvl error case The dsl_destroy_snapshots_nvl() function has an early error out, and temporary nvlists were not freed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10366	2020-05-26 16:13:41 -07:00
Brian Atkinson	fb822260b1	Gang ABD Type Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10069	2020-05-20 18:06:09 -07:00
DeHackEd	57434abae6	Use boot_ncpus in place of max_ncpus in taskq_create Due to hotplug support or BIOS bugs sometimes max_ncpus can be an absurdly high value. I have a system with 32 cores/threads but reports max_ncpus == 440. This many threads potentially cripples the system during arc_prune floods for example. boot_ncpus is the number of working CPUs when called so use that instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #10282	2020-05-20 10:07:21 -07:00
Matthew Ahrens	1b9cd1a9d9	Fix error handling in receive_writer_thread() If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a non-zero error. This bug was introduced by #10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10320	2020-05-14 20:48:29 -07:00
Brian Behlendorf	2ade659eb4	Fix abd_enter/exit_critical wrappers Commit `fc551d7` introduced the wrappers abd_enter_critical() and abd_exit_critical() to mark critical sections. On Linux these are implemented with the local_irq_save() and local_irq_restore() macros which set the 'flags' argument when saving. By wrapping them with a function the local variable is no longer set by the macro and is no longer properly restored. Convert abd_enter_critical() and abd_exit_critical() to macros to resolve this issue and ensure the flags are properly restored. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10332	2020-05-14 20:45:16 -07:00
Jorgen Lundman	eeb8fae9c7	Upstream: add missing thread_exit() Undo FreeBSD wrapper for thread_create() added to call thread_exit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10314	2020-05-14 15:58:09 -07:00
Matthew Ahrens	8b240f14f9	remove unneeded member drc_err of dmu_recv_cookie_t The member drc_err of dmu_recv_cookie_t is used only locally in receive_read, so we can replace it with a local variable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10319	2020-05-14 12:10:29 -07:00
John Poduska	41035a0496	Resilver restarts unnecessarily when it encounters errors When a resilver finishes, vdev_dtl_reassess is called to hopefully excise DTL_MISSING (amongst other things). If there are errors during the resilver, they are tracked in DTL_SCRUB, as spelled out in the block comment in vdev.c. DTL_SCRUB is in-core only, so it can only be used if the pool was online for the whole resilver. This state is tracked with the spa_scrub_started flag, which only gets set when the scan is initialized. Unfortunately, this flag gets cleared right before vdev_dtl_reassess gets called, so if there are any errors during the scan, DTL_MISSING will never get excised and the resilver will just continually restart. This fix simply moves clearing that flag until after the call to vdev_dtl_reasses. In addition, if a pool is imported and already has scn_errors > 0, this change will restart the resilver immediately instead of doing the rest of the scan and then restarting it from the beginning. On the other hand, if scn_errors == 0 at import, then no errors have been encountered so far, so the spa_scrub_started flag can be safely set. A test has been added to verify that resilver does not restart when relevant DTL's are available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #10291	2020-05-13 10:54:27 -07:00
Brian Atkinson	fc551d7efb	Combine OS-independent ABD Code into Common Source File Reorganizing ABD code base so OS-independent ABD code has been placed into a common abd.c file. OS-dependent ABD code has been left in each OS's ABD source files, and these source files have been renamed to abd_os. The OS-independent ABD code is now under: module/zfs/abd.c With the OS-dependent code in: module/os/linux/zfs/abd_os.c module/os/freebsd/zfs/abd_os.c Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10293	2020-05-10 12:23:52 -07:00
George Amanakis	657fd33bcf	Improvements on persistent L2ARC Functional changes: We implement refcounts of log blocks and their aligned size on the cache device along with two corresponding arcstats. The refcounts are reflected in the header of the device and provide valuable information as to whether log blocks are accounted for correctly. These are dynamically adjusted as log blocks are committed/evicted. zdb also uses this information in the device header and compares it to the corresponding values as reported by dump_l2arc_log_blocks() which emulates l2arc_rebuild(). If the refcounts saved in the device header report higher values, zdb exits with an error. For this feature to work correctly there should be no active writes on the device. This is also employed in the tests of persistent L2ARC. We extend the structure of the cache device header by adding the two new variables mirroring the refcounts after the existing variables to preserve backward compatibility in terms of persistent L2ARC. 1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which reflect the total aligned size of log blocks on the device. This is also reflected in the header of the cache device as "dh_lb_asize". 2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count" which reflect the total number of L2ARC log blocks present on cache devices. It is also reflected in the header of the cache device as "dh_lb_count". In l2arc_rebuild_vdev() if the amount of committed log entries in a log block is 0 and the device header is valid we update the device header. This will facilitate trimming of the whole device in this case when TRIM for L2ARC is implemented. Improve loop protection in l2arc_rebuild() by using the starting offset of the payload of each log block instead of the starting offset of the log block. If the zio in l2arc_write_buffers() fails, restore the lbps array in the header of the device to its previous state in l2arc_write_done(). If l2arc_rebuild() ends the rebuild process without restoring any L2ARC log blocks in ARC and without any other error, this means that the lbps array in the header is pointing to non-existent or invalid log blocks. Reset the device header in this case. In l2arc_rebuild() change the zfs_dbgmsg messages to spa_history_log_internal() making them user visible with zpool history command. Non-functional changes: Make the first test in persistent L2ARC use `zdb -lll` to increase coverage in `zdb.c`. Rename psize with asize when referring to log blocks, since L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also rename dh_log_blk_entries to dh_log_entries to make it clear that it is a mirror of l2ad_log_entries. Added comments for both changes. Fix inaccurate comments for example in l2arc_log_blk_restore(). Add asserts at the end in l2arc_evict() and l2arc_write_buffers(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10228	2020-05-07 16:34:03 -07:00

... 2 3 4 5 6 ...

2607 Commits