mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Paul Dagnelie	4aa3b3bd47	Always track temporary fses and snapshots for accounting The root cause of the issue is that we only occasionally do as the comments in the code suggest and actually ignore the %recv dataset when it comes to filesystem limit tracking. Specifically, the only time we ignore it is when initializing the filesystem and snapshot limit values; when creating a new %recv dataset or deleting one, we always update the bookkeeping. This causes a problem if you init the fs count on a filesystem that already has a %recv dataset, since the bookmarking will be decremented but not incremented. This is resolved in this patch by simply always tracking the %recv dataset as a child. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10791	2020-08-26 21:38:27 -07:00
Matthew Macy	2dbad44710	FreeBSD: disable neon usage The neon support code does not build on FreeBSD, ifdef out references to fix linker issues on arm64. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10809	2020-08-26 09:54:37 -07:00
Alexander Motin	523e1295fe	Introduce limit on size of L2ARC headers Since L2ARC buffers are not evicted on memory pressure, too large amount of headers on system with irrationally large L2ARC can render it slow or even unusable. This change limits L2ARC writes and rebuild if unevictable L2ARC-only headers reach dangerous level. While there, call arc_adapt() on L2ARC rebuild, so that it could properly grow arc_c, reflecting potentially significant ARC size increase and avoiding slow growth with hopeless eviction attempts later when "overflow" is detected. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #10765	2020-08-25 14:33:36 -07:00
Brian Behlendorf	94dac3e880	Export dmu_offset_next() symbol Export the dmu_offset_next() symbol for use by Lustre. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10796	2020-08-25 08:34:41 -07:00
Sebastian Gottschall	184df27eef	Avoid symbol collision with in-kernel zstdlib For Linux, when zfs is compiled as an in kernel static variant and the in kernel zstd library is compiled statically into the kernel a symbol collision will occur. This wrapper header renames all of the relevant zstd functions to avoid this problem. Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Closes #10775	2020-08-24 12:20:41 -07:00
Ryan Moeller	6fe3498ca3	Import vdev ashift optimization from FreeBSD Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1. Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2. The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Matthew Macy <mmacy@freebsd.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10619	2020-08-21 12:53:17 -07:00
Matthew Ahrens	3dc18995bd	Fix indentation in dnode_free_range() Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10744	2020-08-20 11:45:20 -07:00
Matthew Macy	1c2725a157	FreeBSD: 11.x arc_stats compatibility Removing other_size from arc_stats breaks top in 11.x jails running on HEAD. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10745	2020-08-20 10:55:02 -07:00
Michael Niewöhner	10b3c7f5e4	Add zstd support to zfs This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <allanjude@freebsd.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #6247 Closes #9024 Closes #10277 Closes #10278	2020-08-20 10:30:06 -07:00
Brian Behlendorf	cfd59f904b	Fix ARC aggsum access after arc_state_fini() Commit `85ec5cbae` updated abd_update_scatter_stats() such that it calls arc_space_consume() and arc_space_return() when updating the scatter stats. This requires that the global aggsum value for the ARC be initialized. Normally this is not an issue, however during module unload the l2arc_do_free_on_write() function was called in l2arc_cleanup() after arc_state_fini() destroyed the aggsum values. We can resolve this issue by performing l2arc_do_free_on_write() slightly earlier in arc_fini(). Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10739	2020-08-18 22:11:34 -07:00
Matthew Macy	716b53d0a1	FreeBSD: Fix UNIX permissions checking Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10727	2020-08-18 09:57:07 -07:00
Ryan Moeller	009cc8e884	Make zc_nvlist_src_size limit tunable We limit the size of nvlists passed to the kernel so a user cannot make the kernel do an unreasonably large allocation. On FreeBSD this limit was 128 kiB, which turns out to be a bit too small when doing some operations involving a large number of datasets or snapshots, for example replication. Make this limit tunable, with a platform-specific auto default. Linux keeps its limit at KMALLOC_MAX_SIZE. FreeBSD uses 1/4 of the system limit on user wired memory, which allows it to scale depending on system configuration. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Issue #6572 Closes #10706	2020-08-18 09:33:55 -07:00
Richard Laager	eaa25f1a8e	Remove GRUB restrictions The GRUB restrictions are based around the pool's bootfs property. Given the current situation where GRUB is not staying current with OpenZFS pool features, having either a non-ZFS /boot or a separate pool with limited features are pretty much the only long-term answers for GRUB support. Only the second case matters in this context. For the restrictions to be useful, the bootfs property would have to be set on the boot pool, because that is where we need the restrictions, as that is the pool that GRUB reads from. The documentation for bootfs describes it as pointing to the root pool. That's also how it's used in the initramfs. ZFS does not allow setting bootfs to point to a dataset in another pool. (If it did, it'd be difficult-to-impossible to enforce these restrictions cross-pool). Accordingly, bootfs is pretty much useless for GRUB scenarios moving forward. Even for users who have only one pool, the existing restrictions for GRUB are incomplete. They don't prevent you from enabling the unsupported checksums, for example. For that reason, I have ripped out all the GRUB restrictions. A little longer-term, I think extending the proposed features=portable system to define a features=grub is a much more useful approach. The user could set that on the boot pool at creation, and things would Just Work. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8627	2020-08-17 23:12:39 -07:00
Matthew Ahrens	85ec5cbae2	Include scatter_chunk_waste in arc_size The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10701	2020-08-17 20:04:04 -07:00
Ryan Moeller	3df0c2fa32	FreeBSD: fix the build with Clang 11 * Cast void * to uintptr_t before casting to boolean_t. * Avoid clashing definition of __asm when not on Linux to prevent duplicate __volatile__. This was already done in some places but not all. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes #10723	2020-08-17 15:40:17 -07:00
Serapheim Dimitropoulos	b0099072df	Fix typo in btree.c Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10725	2020-08-17 15:25:37 -07:00
Matthew Macy	5f1984f2f8	FreeBSD: fallback to /boot/ to look for zpool.cache Up until now zpool.cache has always lived in /boot on FreeBSD. For the sake of compatibility fallback to /boot if zpool.cache isn't found in /etc/zfs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10720	2020-08-17 14:43:47 -07:00
Ryan Moeller	3eaf76a8d2	Fix l2arc_dev_rebuild_start thread name `thread_create` on FreeBSD stringifies the argument passed as the thread function to create a name for the thread. The thread name for `l2arc_dev_rebuild_start` ended up with `(void ()(void ))` in it. Change the type signature so the function does not need to be cast when creating the thread. Rename the function to `l2arc_dev_rebuild_thread` for clarity and consistency, as well. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10716	2020-08-17 11:02:32 -07:00
Allan Jude	fc34dfba8e	Fix L2ARC reads when compressed ARC disabled When reading compressed blocks from the L2ARC, with compressed ARC disabled, arc_hdr_size() returns LSIZE rather than PSIZE, but the actual read is PSIZE. This causes l2arc_read_done() to compare the checksum against the wrong size, resulting in checksum failure. This manifests as an increase in the kstat l2_cksum_bad and the read being retried from the main pool, making the L2ARC ineffective. Add new L2ARC tests with Compressed ARC enabled/disabled Blocks are handled differently depending on the state of the zfs_compressed_arc_enabled tunable. If a block is compressed on-disk, and compressed_arc is enabled: - the block is read from disk - It is NOT decompressed - It is added to the ARC in its compressed form - l2arc_write_buffers() may write it to the L2ARC (as is) - l2arc_read_done() compares the checksum to the BP (compressed) However, if compressed_arc is disabled: - the block is read from disk - It is decompressed - It is added to the ARC (uncompressed) - l2arc_write_buffers() will use l2arc_apply_transforms() to recompress the block, before writing it to the L2ARC - l2arc_read_done() compares the checksum to the BP (compressed) - l2arc_read_done() will use l2arc_untransform() to uncompress it This test writes out a test file to a pool consisting of one disk and one cache device, then randomly reads from it. Since the arc_max in the tests is low, this will feed the L2ARC, and result in reads from the L2ARC. We compare the value of the kstat l2_cksum_bad before and after to determine if any blocks failed to survive the trip through the L2ARC. Sponsored-by: The FreeBSD Foundation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #10693	2020-08-13 23:31:20 -07:00
Jorgen Lundman	faa296c73c	Release onexit/events with any missed zfsdev_state Linux and FreeBSD will most likely never see this issue. On macOS when kext is unloaded, but zed is still connected, zed will be issued ENODEV. As the cdevsw is released, the kernel will not have zfsdev_release() called to release minor/onexit/events, and it "leaks". This ensures it is cleaned up before unload. Changed the for loop from zsprev, to zsnext style, for less code duplication. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10700	2020-08-13 15:03:23 -07:00
Matthew Ahrens	d64c6a2eee	Use zfs_dbgmsg to log metaslab_load/unload Metaslabs are now (usually) loaded and unloaded infrequently, but when that is not the case, it is useful to have a log of when and why these events happened. This commit enables the zfs_dbgmsg() in metaslab_load(), and adds a zfs_dbgmsg() in metaslab_unload(). Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10683	2020-08-12 10:10:50 -07:00
Matthew Macy	e111c80247	Restore ARC MFU/MRU pressure The arc_adapt() function tunes LRU/MLU balance according to 4 types of cache hits (which is passed as state agrument): ghost LRU, LRU, MRU, ghost MRU. If this function is called with wrong cache hit (state), adaptation will be sub-optimal and performance will suffer. Some time ago upstream received this commit: 6950 ARC should cache compressed data) in arc_read() do next sequence (access to ghost buffer) Before this commit, hit to any ghost list was passed arc_adapt() before call to arc_access() which revive element in cache and change state from ghost to real hit. After this commit, the order of calls was reverted and arc_adapt() is now called only with «real» hits even if hit was in one of two ghost lists, which renders ghost lists useless and breaks the ARC algorithm. FreeBSD fixed this problem locally in Change D19094 / Commit r348772. This change is an adaptation of the above commit to the current arc code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10548 Closes #10618	2020-08-12 10:03:24 -07:00
Allan Jude	9777044f1c	Fix typo Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #10694	2020-08-11 13:16:57 -07:00
Paul Dagnelie	12045d0278	Clarify error message when a range-tree double-add occurs In various other pieces of logic have resulted in situations where we double-free space in ZFS. This in turn results in a double-add to the range trees. These issues have been much more difficult to diagnose than they should have been, because the error handling around this case is much weaker than around the double remove case. Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10654	2020-08-07 14:13:13 -07:00
Matthew Ahrens	d87676a9fa	Fix i/o error handling of livelists and zap iteration Pool-wide metadata is stored in the MOS (Meta Object Set). This metadata is stored in triplicate, in addition to any pool-level reduncancy (e.g. RAIDZ). However, if all 3+ copies of this metadata are not available, we can still get EIO/ECKSUM when reading from the MOS. If we encounter such an error in syncing context, we have typically already committed to making a change that we now can't do because of the corrupt/missing metadata. We typically "handle" this with a `VERIFY()` or `zfs_panic_recover()`. This prevents the system from continuing on in an undefined state, while minimizing the amount of error-handling code. However, there are some code paths that ignore these i/o errors, or `ASSERT()` that they don't happen. Since assertions are disabled on non-debug builds, they effectively ignore them as well. This can lead to ZFS continuing on in an incorrect state, potentially leading to on-disk inconsistencies. This commit adds handling for these i/o errors on MOS metadata, typically with a `VERIFY()`: * Handle error return from `zap_cursor_retrieve()` in 4 places in `dsl_deadlist.c`. * Handle error return from `zap_contains()` in `dsl_dir_hold_obj()`. Turns out this call isn't necessary because we can always call `zap_lookup()`. * Handle error return from `zap_lookup()` in `dsl_fs_ss_limit_check()`. * Handle error return from `zap_remove()` in `dsl_dir_rename_sync()`. * Handle error return from `zap_lookup()` in `dsl_dir_remove_livelist()`. * Handle error return from `dsl_process_sub_livelist()` in `spa_livelist_delete_cb()`. Additionally: * Augment the internal history log message for `zfs destroy` to note which method is used (e.g. bptree, livelist, or, synchronous) and the mintxg. * Correct a comment in `dbuf_init()`. * Correct indentation in `dsl_dir_remove_livelist()`. Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10643	2020-08-05 10:22:09 -07:00
Matthew Macy	22dcf89181	Add missed thread_exit() to vdev_{autotrim,rebuild}_thread Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10668	2020-08-05 10:17:07 -07:00
George Amanakis	da60484db5	Fix logging in l2arc_rebuild() In case the L2ARC rebuild was canceled, do not log to spa history log as the pool may be in the process of being removed and a panic may occur: BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:spa_history_log_internal+0xb1/0x120 [zfs] Call Trace: l2arc_rebuild+0x464/0x7c0 [zfs] l2arc_dev_rebuild_start+0x2d/0x130 [zfs] ? l2arc_rebuild+0x7c0/0x7c0 [zfs] thread_generic_wrapper+0x78/0xb0 [spl] kthread+0xfb/0x130 ? IS_ERR+0x10/0x10 [spl] ? kthread_park+0x90/0x90 ret_from_fork+0x35/0x40 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10659	2020-08-01 11:17:18 -07:00
Allan Jude	8fb79fdddb	Change the error handling for invalid property values ZFS recv should return a useful error message when an invalid index property value is provided in the send stream properties nvlist With a compression= property outside of the understood range: Before: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 internal error: Invalid argument Aborted (core dumped) ``` Note: the recv completes successfully, the abort() is likely just to make it easier to track the unexpected error code. After: ``` receiving full stream of zof/zstd_send@send2 into testpool/recv@send2 cannot receive compression property on testpool/recv: invalid property value received 28.9M stream in 1 seconds (28.9M/sec) ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #10631	2020-08-01 08:41:31 -07:00
Matthew Macy	47ed79ff60	Changes to make openzfs build within FreeBSD buildworld A collection of header changes to enable FreeBSD to build with vendored OpenZFS. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10635	2020-07-31 21:30:31 -07:00
Matthew Ahrens	3442c2a02d	Revise ARC shrinker algorithm The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the kernel's shrinker mechanism when the system is running low on free pages. This happens via 2 code paths: 1. "direct reclaim": The system is attempting to allocate a page, but we are low on memory. The ARC shrinker callback is invoked from the page-allocation code path. 2. "indirect reclaim": kswapd notices that there aren't many free pages, so it invokes the ARC shrinker callback. In both cases, the kernel's shrinker code requests that the ARC shrinker callback release some of its cache, and then it measures how many pages were released. However, it's measurement of released pages does not include pages that are freed via `__free_pages()`, which is how the ARC releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker code is looking for pages to be placed on the lists of reclaimable pages (which is separate from actually-free pages). Because the kernel shrinker code doesn't detect that the ARC has released pages, it may call the ARC shrinker callback many times, resulting in the ARC "collapsing" down to `arc_c_min`. This has several negative impacts: 1. ZFS doesn't use RAM to cache data effectively. 2. In the direct reclaim case, a single page allocation may wait a long time (e.g. more than a minute) while we evict the entire ARC. 3. Even with the improvements made in `67c0f0dedc` ("ARC shrinking blocks reads/writes"), occasionally `arc_size` may stay above `arc_c` for the entire time of the ARC collapse, thus blocking ZFS read/write operations in `arc_get_data_impl()`. To address these issues, this commit limits the ways that the ARC shrinker callback can be used by the kernel shrinker code, and mitigates the impact of arc_is_overflowing() on ZFS read/write operations. With this commit: 1. We limit the amount of data that can be reclaimed from the ARC via the "direct reclaim" shrinker. This limits the amount of time it takes to allocate a single page. 2. We do not allow the ARC to shrink via kswapd (indirect reclaim). Instead we rely on `arc_evict_zthr` to monitor free memory and reduce the ARC target size to keep sufficient free memory in the system. Note that we can't simply rely on limiting the amount that we reclaim at once (as for the direct reclaim case), because kswapd's "boosted" logic can invoke the callback an unlimited number of times (see `balance_pgdat()`). 3. When `arc_is_overflowing()` and we want to allocate memory, `arc_get_data_impl()` will wait only for a multiple of the requested amount of data to be evicted, rather than waiting for the ARC to no longer be overflowing. This allows ZFS reads/writes to make progress even while the ARC is overflowing, while also ensuring that the eviction thread makes progress towards reducing the total amount of memory used by the ARC. 4. The amount of memory that the ARC always tries to keep free for the rest of the system, `arc_sys_free` is increased. 5. Now that the shrinker callback is able to provide feedback to the kernel's shrinker code about our progress, we can safely enable the kswapd hook. This will allow the arc to receive notifications when memory pressure is first detected by the kernel. We also re-enable the appropriate kstats to track these callbacks. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10600	2020-07-31 21:10:52 -07:00
Allan Jude	eabf270b2c	Remove duplicate include of sys/zfeature.h in dmu_objset.c Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #10636	2020-07-31 09:04:45 -07:00
Matthew Ahrens	948423a3d1	zfs promote does not delete livelist of origin When a clone is promoted, its livelist is no longer accurate, so it is discarded. If the clone's origin is also a clone (i.e. we are promoting a clone of a clone), then the origin's livelist is also no longer accurate, so it should be discarded, but the code doesn't actually do that. Consider a pool with: * Filesystem A * Clone B, a clone of A * Clone C, a clone of B If we promote C, it discards C's livelist. It should discard B's livelist, but that is not happening. The impact is that when B is destroyed, we use the livelist to find the blocks to free, but the livelist is no longer correct so we end up freeing blocks that are still in use by C. The incorrectly-freed blocks can be reallocated causing checksum errors. And when C is destroyed it can double-free the incorrectly-freed blocks. The problem is that we remove the livelist of `origin_ds->ds_dir`, but the origin snapshot has already been moved to the promoted dsl_dir. So this is actually trying to remove the livelist of the promoted dsl_dir, which was already removed. As explained in a comment in the beginning of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the origin's dsl_dir. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10652	2020-07-31 08:59:00 -07:00
Matthew Ahrens	3a92552f75	Fix error handling of vdev_top_zap In `vdev_load()`, we look up several entries in the `vdev_top_zap` object. In most cases, if we encounter an i/o error, it will be returned to the caller. However, when handling `VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue on, which in theory could cause us to not realize that a vdev should be used only for `special` allocations. In practice, if we encountered an i/o error while looking for `VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an i/o error while looking for other entries in the same object, and thus the zpool open/import would fail. Therefore the impact of this problem is negligible. This commit adds error handling for i/o errors while accessing the `vdev_top_zap`, so that we aren't relying on unrelated code to fail for us. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10637	2020-07-29 17:04:34 -07:00
Matthew Macy	27d96d2254	Rename refcount.h to zfs_refcount.h Renamed to avoid conflicting with refcount.h when a different implementation is already provided by the platform. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10620	2020-07-29 16:35:33 -07:00
Serapheim Dimitropoulos	843e9ca2e1	Introduce names for ZTHRs When debugging issues or generally analyzing the runtime of a system it would be nice to be able to tell the different ZTHRs running by name rather than having to analyze their stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #10630	2020-07-29 09:43:33 -07:00
Matthew Macy	5678d3f593	Prefix zfs internal endian checks with _ZFS FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN LITTLE_ENDIAN on every architecture. Trying to do cross builds whilst hiding this from ZFS has proven extremely cumbersome. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10621	2020-07-28 13:02:49 -07:00
Matthew Macy	e64cc4954c	Refactor ccompile.h to not include system headers This is a step toward being able to vendor the OpenZFS code in FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10625	2020-07-25 20:09:50 -07:00
Matthew Macy	6d8da84106	Make use of ZFS_DEBUG consistent within kmod sources Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10623	2020-07-25 20:07:44 -07:00
Matthew Macy	f5b189f937	FreeBSD: Fixes required to build ZFS on PowerPC Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10622	2020-07-25 11:00:23 -07:00
Brian Atkinson	6fba7bfd0e	Add gang ABD child to parent gang ABD By design a gang ABD can not have another gang ABD as a child. This is to make sure the logical offset in a gang ABD is consistent with the individual ABDS it contains as children. If a gang ABD is added as a child of a gang ABD we will add the individual children of the gang ABD to the parent gang ABD. This allows for a consistent view of offsets within the parent gang ABD. Reviewed-by: Mark Maybee <mmaybee@cray.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10430	2020-07-24 21:09:20 -07:00
Ryan Moeller	8348fac30c	Limit dbuf cache sizes based only on ARC target size by default Set the initial max sizes to ULONG_MAX to allow the caches to grow with the ARC. Recalculate the metadata cache size on demand so it can adapt, too. Update descriptions in zfs-module-parameters(5). Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10563 Closes #10610	2020-07-24 20:38:48 -07:00
Matthew Ahrens	5dd92909c6	Adjust ARC terminology The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10592	2020-07-22 09:51:47 -07:00
Matthew Ahrens	026e529cb3	Remove skc_reclaim, hdr_recl, kmem_cache shrinker The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`, whereby individual caches can register a callback to be invoked when there is memory pressure. This mechanism is used in only one place: the ARC registers the `hdr_recl()` reclaim function. This function wakes up the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and `arc_reduce_target_size()`. The `skc_reclaim` callbacks are invoked only by shrinker callbacks and `arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When called from shrinker callbacks, we are already aware of memory pressure and responding to it. Therefore there is little benefit to ever calling the `hdr_recl()` `skc_reclaim` callback. The `arc_reap_zthr` also wakes once a second, and if memory is low when allocating an ARC buffer. Therefore, additionally waking it from the shrinker calbacks has little benefit. The shrinker callbacks can be invoked very frequently, e.g. 10,000 times per second. Additionally, for invocation of the shrinker callback, skc_reclaim is invoked many times. Therefore, this mechanism consumes significant amounts of CPU time. The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which, in addition to invoking `skc_reclaim()`, does two things to attempt to free pages for use by the system: 1. Return free objects from the magazine layer to the slab layer 2. Return entirely-free slabs to the page layer (i.e. free pages) These actions apply only to caches implemented by the SPL, not those that use the underlying kernel SLAB/SLUB caches. The SPL caches are used for objects >=32KB, which are primarily linear ABD's cached in the DBUF cache. These actions (freeing objects from the magazine layer and returning entirely-free slabs) are also taken whenever a `kmem_cache_free()` call finds a full magazine. So there would typically be zero entirely-free slabs, and the number of objects in magazines is limited (typically no more than 64 objects per magazine, and there's one magazine per CPU). Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is modest. We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when memory pressure is detected. Therefore, calling `spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed. This commit removes the `skc_reclaim` mechanism, its only callback `hdr_recl()`, and the kmem_cache shrinker callback. Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10576	2020-07-19 09:58:30 -07:00
Matthew Ahrens	6774931dfa	Extend zdb to print inconsistencies in livelists and metaslabs Livelists and spacemaps are data structures that are logs of allocations and frees. Livelists entries are block pointers (blkptr_t). Spacemaps entries are ranges of numbers, most often used as to track allocated/freed regions of metaslabs/vdevs. These data structures can become self-inconsistent, for example if a block or range can be "double allocated" (two allocation records without an intervening free) or "double freed" (two free records without an intervening allocation). ZDB (as well as zfs running in the kernel) can detect these inconsistencies when loading livelists and metaslab. However, it generally halts processing when the error is detected. When analyzing an on-disk problem, we often want to know the entire set of inconsistencies, which is not possible with the current behavior. This commit adds a new flag, `zdb -y`, which analyzes the livelist and metaslab data structures and displays all of their inconsistencies. Note that this is different from the leak detection performed by `zdb -b`, which checks for inconsistencies between the spacemaps and the tree of block pointers, but assumes the spacemaps are self-consistent. The specific checks added are: Verify livelists by iterating through each sublivelists and: - report leftover FREEs - report double ALLOCs and double FREEs - record leftover ALLOCs together with their TXG [see Cross Check] Verify spacemaps by iterating over each metaslab and: - iterate over spacemap and then the metaslab's entries in the spacemap log, then report any double FREEs and double ALLOCs Verify that livelists are consistenet with spacemaps. The space referenced by livelists (after using the FREE's to cancel out corresponding ALLOCs) should be allocated, according to the spacemaps. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-66031 Closes #10515	2020-07-14 17:51:05 -07:00
Alexander Motin	1743c737f5	Fix LOR between dp_config_rwlock and spa_props_lock Our QE team during automated API testing hit deadlock in ZFS, caused by lock order reversal. From one side dsl_sync_task_sync() locks dp_config_rwlock as writer and calls spa_sync_props(), which waits for spa_props_lock. From another spa_prop_get() locks spa_props_lock and then calls dsl_pool_config_enter(), trying to lock dp_config_rwlock as reader. This patch makes spa_prop_get() lock dp_config_rwlock before spa_props_lock, making the order consistent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #10553	2020-07-14 12:21:57 -07:00
Brian Atkinson	e4d3d77684	Fixing gang ABD child removal race condition On linux the list debug code has been setting off a failure when checking that the node->next->prev value is pointing back at the node. At times this check evaluates to 0xdead. When removing a child from a gang ABD we must acquire the child's abd_mtx to make sure that the same ABD is not being added to another gang ABD while it is being removed from a gang ABD. This fixes a race condition when checking if an ABDs link is already active and part of another gang ABD before adding it to a gang. Added additional debug code for the gang ABD in abd_verify() to make sure each child ABD has active links. Also check to make sure another gang ABD is not added to a gang ABD. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10511	2020-07-14 11:04:35 -07:00
Matthew Ahrens	e59a377a8f	filesystem_limit/snapshot_limit is incorrectly enforced against root The filesystem_limit and snapshot_limit properties limit the number of filesystems or snapshots that can be created below this dataset. According to the manpage, "The limit is not enforced if the user is allowed to change the limit." Two types of users are allowed to change the limit: 1. Those that have been delegated the `filesystem_limit` or `snapshot_limit` permission, e.g. with `zfs allow USER filesystem_limit DATASET`. This works properly. 2. A user with elevated system privileges (e.g. root). This does not work - the root user will incorrectly get an error when trying to create a snapshot/filesystem, if it exceeds the `_limit` property. The problem is that `priv_policy_ns()` does not work if the `cred_t` is not that of the current process. This happens when `dsl_enforce_ds_ss_limits()` is called in syncing context (as part of a sync task's check func) to determine the permissions of the corresponding user process. This commit fixes the issue by passing the `task_struct` (typedef'ed as a `proc_t`) to syncing context, and then using `has_capability()` to determine if that process is privileged. Note that we still need to pass the `cred_t` to syncing context so that we can check if the user was delegated this permission with `zfs allow`. This problem only impacts Linux. Wrappers are added to FreeBSD but it continues to use `priv_check_cred()`, which works on arbitrary `cred_t`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8226 Closes #10545	2020-07-11 17:18:02 -07:00
George Amanakis	2054f35e56	Fix a persistent L2ARC bug in l2arc_write_done() In case l2arc_write_done() handles a zio that was not successful check that the list of log block pointers is not empty when restoring them in the device header. Otherwise zero them out. In any case perform the actual write updating the device header after the zio of l2arc_write_buffers() completes as l2arc_write_done() may have touched the memory holding the log block pointers in the device header. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10540 Closes #10543	2020-07-10 14:10:03 -07:00
Mark Johnston	6e00561712	Add a "try" operation for range locks zfs_rangelock_tryenter() bails immediately instead of waiting for the lock to become available. This will be used to resolve a deadlock in the FreeBSD page-in code. No functional change intended. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #10519	2020-07-06 11:53:31 -07:00
Brian Behlendorf	9a49d3f3d3	Add device rebuild feature The device_rebuild feature enables sequential reconstruction when resilvering. Mirror vdevs can be rebuilt in LBA order which may more quickly restore redundancy depending on the pools average block size, overall fragmentation and the performance characteristics of the devices. However, block checksums cannot be verified as part of the rebuild thus a scrub is automatically started after the sequential resilver completes. The new '-s' option has been added to the `zpool attach` and `zpool replace` command to request sequential reconstruction instead of healing reconstruction when resilvering. zpool attach -s <pool> <existing vdev> <new vdev> zpool replace -s <pool> <old vdev> <new vdev> The `zpool status` output has been updated to report the progress of sequential resilvering in the same way as healing resilvering. The one notable difference is that multiple sequential resilvers may be in progress as long as they're operating on different top-level vdevs. The `zpool wait -t resilver` command was extended to wait on sequential resilvers. From this perspective they are no different than healing resilvers. Sequential resilvers cannot be supported for RAIDZ, but are compatible with the dRAID feature being developed. As part of this change the resilver_restart_* tests were moved in to the functional/replacement directory. Additionally, the replacement tests were renamed and extended to verify both resilvering and rebuilding. Original-patch-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: John Poduska <jpoduska@datto.com> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10349	2020-07-03 11:05:50 -07:00
Matthew Macy	7ddb753d17	freebsd: changes necessary to coexist with dtrace in tree Fix header conflicts when building zfs with openzfs as a vendor import. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10497	2020-07-01 09:10:08 -07:00
Matthew Ahrens	3c42c9ed84	Clean up OS-specific ARC and kmem code OS-specific code (e.g. under `module/os/linux`) does not need to share its code structure with any other operating systems. In particular, the ARC and kmem code need not be similar to the code in illumos, because we won't be syncing this OS-specific code between operating systems. For example, if/when illumos support is added to the common repo, we would add a file `module/os/illumos/zfs/arc_os.c` for the illumos versions of this code. Therefore, we can simplify the code in the OS-specific ARC and kmem routines. These changes do not impact system behavior, they are purely code cleanup. The changes are: Arenas are not used on Linux or FreeBSD (they are always `NULL`), so `heap_arena`, `zio_arena`, and `zio_alloc_arena` can be removed, along with code that uses them. In `arc_available_memory()`: * `desfree` is unused, remove it * rename `freemem` to avoid conflict with pre-existing `#define` * remove checks related to arenas * use units of bytes, rather than converting from bytes to pages and then back to bytes `SPL_KMEM_CACHE_REAP` is unused, remove it. `skc_reap` is unused, remove it. The `count` argument to `spl_kmem_cache_reap_now()` is unused, remove it. `vmem_size()` and associated type and macros are unused, remove them. In `arc_memory_throttle()`, use a less confusing variable name to store the result of `arc_free_memory()`. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10499	2020-06-29 09:01:07 -07:00
Matthew Ahrens	67c0f0dedc	ARC shrinking blocks reads/writes ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to allow the ARC to shrink when the kernel experiences memory pressure. The ARC shrinker changes `arc_c` via a call to `arc_reduce_target_size()`. Before commit `3ec34e5527`, the ARC shrinker would also evict data from the ARC to bring `arc_size` down to the new `arc_c`. However, that commit (seemingly inadvertently) made it so that the ARC shrinker no longer evicts any data or waits for eviction to complete. Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often all the way to `arc_c_min`. Since it doesn't wait for the actual eviction of data from the ARC, this creates a situation where `arc_size` is more than `arc_c` for the several seconds/minutes it takes for `arc_adjust_zthr` to evict data from the ARC. During this time, arc_get_data_impl() will block, so ZFS can't process read/write requests (e.g. from iSCSI, NFS, or read/write syscalls). To ensure that `arc_c` doesn't shrink faster than the adjust thread can keep up, this commit makes the ARC shrinker wait for the eviction to complete, resulting in similar behavior to what we had before commit `3ec34e5527`. Note: commit `3ec34e5527` is `OpenZFS 9284 - arc_reclaim_thread has 2 jobs` and was integrated in December 2018, and is part of ZoL 0.8.x but not 0.7.x. Additionally, when the ARC size is reduced drastically, the `arc_adjust_zthr` can be on-CPU for many seconds without blocking. Any threads that are bound to the same CPU that arc_adjust_zthr is running on will not able to run for a long time. To ensure that CPU-bound threads can make progress, this commit changes `arc_evict_state_impl()` make a voluntary preemption call, `cond_resched()`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-70703 Closes #10496	2020-06-26 10:42:27 -07:00
Ryan Moeller	9192f27c1d	Add zfs_multihost_interval tunable handler for FreeBSD This tunable required a handler to be implemented for ZFS_MODULE_PARAM_CALL. Add the handler so the tunable can be declared in common code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10490	2020-06-23 13:32:42 -07:00
Arvind Sankar	0ce2de637b	Add prototypes Add prototypes/move prototypes to header files. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:32 -07:00
Arvind Sankar	60356b1a21	Add include files for prototypes Include the header with prototypes in the file that provides definitions as well, to catch any mismatch between prototype and definition. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:25 -07:00
Arvind Sankar	c3fe42aabd	Remove dead code Delete unused functions. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:21:18 -07:00
Arvind Sankar	65c7cc49bf	Mark functions as static Mark functions used only in the same translation unit as static. This only includes functions that do not have a prototype in a header file either. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10470	2020-06-18 12:20:38 -07:00
Matthew Macy	8056a75672	Disambiguate condvar API contract On Illumos callers of cv_timedwait and cv_timedwait_hires can't distinguish between whether or not the cv was signaled or the call timed out. Illumos handles this (for some definition of handles) by calling cv_signal in the return path if we were signaled but the return value indicates instead that we timed out. This would make sense if it were possible to query the the cv for its net signal disposition. However, this isn't possible and, in spite of the fact that there are places in the code that clearly take a different and incompatible path if a timeout value is indicated, this distinction appears to be rather subtle to most developers. This problem is further compounded by the fact that on Linux, calling cv_signal in the return path wouldn't even do the right thing unless there are other waiters. Since it is possible for the caller to independently determine how much time is remaining but it is not possible to query if the cv was in fact signaled, prioritizing signalling over timeout seems like a cleaner solution. In addition, judging from usage patterns within the code itself, it is also less error prone. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10471	2020-06-18 10:17:50 -07:00
Matthew Macy	7564073ed6	Add abd_cache_reap_now for abd_chunk_cache users Apparently missed in the initial port integration was the need to reap the abd_chunk_cache on FreeBSD. This change addresses that oversight. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10474	2020-06-17 21:44:13 -07:00
Jorgen Lundman	4458157bee	zfs_ioctl: saved_poolname can be truncated As it uses kmem_strdup() and kmem_strfree() which both rely on strlen() being the same, but saved_poolname can be truncated causing: SPL: kernel memory allocator: buffer freed to wrong cache SPL: buffer was allocated from kmem_alloc_16, SPL: caller attempting free to kmem_alloc_8. SPL: buffer=0xffffff90acc66a38 bufctl=0x0 cache: kmem_alloc_8 Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10469	2020-06-17 14:30:03 -07:00
Alexander Motin	17ca30185a	Set initial arc_c to arc_c_min instead of arc_c_max For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #10437	2020-06-17 14:27:04 -07:00
Jorgen Lundman	883a40fff4	Add convenience wrappers for common uio usage The macOS uio struct is opaque and the API must be used, this makes the smallest changes to the code for all platforms. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10412	2020-06-14 10:09:55 -07:00
Jorgen Lundman	4f73576ea1	Upstream: zil_commit_waiter() can stall forever On macOS clock_t is unsigned, so when cv_timedwait_hires() returns -1 we loop forever. The conditional was tweaked to ignore signedness. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10445	2020-06-14 10:08:21 -07:00
Arvind Sankar	71504277ae	Cleanup linux module kbuild files The linux module can be built either as an external module, or compiled into the kernel, using copy-builtin. The source and build directories are slightly different between the two cases, and currently, compiling into the kernel still refers to some files from the configured ZFS source tree, instead of the copies inside the kernel source tree. There is also duplication between copy-builtin, which creates a Kbuild file to build ZFS inside the kernel tree, and the top-level module/Makefile.in. Fix this by moving the list of modules and the CFLAGS settings into a new module/Kbuild.in, which will be used by the kernel kbuild infrastructure, and using KBUILD_EXTMOD to distinguish the two cases within the Makefiles, in order to choose appropriate include directories etc. Module CFLAGS setting is simplified by using subdir-ccflags-y (available since 2.6.30) to set them in the top-level Kbuild instead of each individual module. The disabling of -Wunused-but-set-variable is removed from the lua and zfs modules. The variable that the Makefile uses is actually not defined, so this has no effect; and the warning has long been disabled by the kernel Makefile itself. The target_cpu definition in module/{zfs,zcommon} is removed as it was replaced by use of CONFIG_SPARC64 in commit `70835c5b75` ("Unify target_cpu handling") os/linux/{spl,zfs} are removed from obj-m, as they are not modules in themselves, but are included by the Makefile in the spl and zfs module directories. The vestigial Makefiles in os and os/linux are removed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Closes #10379 Closes #10421	2020-06-10 09:24:15 -07:00
Andrea Gelmini	dd4bc569b9	Fix typos Correct various typos in the comments and tests. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Closes #10423	2020-06-09 21:24:09 -07:00
Matthew Ahrens	7bcb7f0840	File incorrectly zeroed when receiving incremental stream that toggles -L Background: By increasing the recordsize property above the default of 128KB, a filesystem may have "large" blocks. By default, a send stream of such a filesystem does not contain large WRITE records, instead it decreases objects' block sizes to 128KB and splits the large blocks into 128KB blocks, allowing the large-block filesystem to be received by a system that does not support the `large_blocks` feature. A send stream generated by `zfs send -L` (or `--large-block`) preserves the large block size on the receiving system, by using large WRITE records. When receiving an incremental send stream for a filesystem with large blocks, if the send stream's -L flag was toggled, a bug is encountered in which the file's contents are incorrectly zeroed out. The contents of any blocks that were not modified by this send stream will be lost. "Toggled" means that the previous send used `-L`, but this incremental does not use `-L` (-L to no-L); or that the previous send did not use `-L`, but this incremental does use `-L` (no-L to -L). Changes: This commit addresses the problem with several changes to the semantics of zfs send/receive: 1. "-L to no-L" incrementals are rejected. If the previous send used `-L`, but this incremental does not use `-L`, the `zfs receive` will fail with this error message: incremental send stream requires -L (--large-block), to match previous receive. 2. "no-L to -L" incrementals are handled correctly, preserving the smaller (128KB) block size of any already-received files that used large blocks on the sending system but were split by `zfs send` without the `-L` flag. 3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`. This feature indicates that we can correctly handle "no-L to -L" incrementals. This flag is currently not set on any send streams. In the future, we intend for incremental send streams of snapshots that have large blocks to use `-L` by default, and these streams will also have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams from the default use of `zfs send` won't encounter the bug mentioned above, because they can't be received by software with the bug. Implementation notes: To facilitate accessing the ZPL's generation number, `zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and restructured to fill in a struct with ZPL-specific info including owner and generation. In the "no-L to -L" case, if this is a compressed send stream (from `zfs send -cL`), large WRITE records that are being written to small (128KB) blocksize files need to be decompressed so that they can be written split up into multiple blocks. The zio pipeline will recompress each smaller block individually. A new test case, `send-L_toggle`, is added, which tests the "no-L to -L" case and verifies that we get an error for the "-L to no-L" case. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #6224 Closes #10383	2020-06-09 10:41:01 -07:00
George Amanakis	b7654bd794	Trim L2ARC The l2arc_evict() function is responsible for evicting buffers which reference the next bytes of the L2ARC device to be overwritten. Teach this function to additionally TRIM that vdev space before it is overwritten if the device has been filled with data. This is done by vdev_trim_simple() which trims by issuing a new type of TRIM, TRIM_TYPE_SIMPLE. We also implement a "Trim Ahead" feature. It is a zfs module parameter, expressed in % of the current write size. This trims ahead of the current write size. A minimum of 64MB will be trimmed. The default is 0 which disables TRIM on L2ARC as it can put significant stress to underlying storage devices. To enable TRIM on L2ARC we set l2arc_trim_ahead > 0. We also implement TRIM of the whole cache device upon addition to a pool, pool creation or when the header of the device is invalid upon importing a pool or onlining a cache device. This is dependent on l2arc_trim_ahead > 0. TRIM of the whole device is done with TRIM_TYPE_MANUAL so that its status can be monitored by zpool status -t. We save the TRIM state for the whole device and the time of completion on-disk in the header, and restore these upon L2ARC rebuild so that zpool status -t can correctly report them. Whole device TRIM is done asynchronously so that the user can export of the pool or remove the cache device while it is trimming (ie if it is too slow). We do not TRIM the whole device if persistent L2ARC has been disabled by l2arc_rebuild_enabled = 0 because we may not want to lose all cached buffers (eg we may want to import the pool with l2arc_rebuild_enabled = 0 only once because of memory pressure). If persistent L2ARC has been disabled by setting the module parameter l2arc_rebuild_blocks_min_l2size to a value greater than the size of the cache device then the whole device is trimmed upon creation or import of a pool if l2arc_trim_ahead > 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #9713 Closes #9789 Closes #10224	2020-06-09 10:15:08 -07:00
Pawel Jakub Dawidek	529246df96	Restore support for in-kernel ZFS ioctls In Illumos it is possible to call ioctl functions from within the kernel by passing the FKIOCTL flag. Neither FreeBSD nor Linux support that, but it doesn't hurt to keep it around, as all the code is there. Before this commit it was a dead code and zc_iflags was always zero. Restore this functionality by allowing to pass a flag to the zfsdev_ioctl_common() function. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #10417	2020-06-08 13:57:22 -07:00
Jorgen Lundman	c9e319faae	Replace sprintf()->snprintf() and strcpy()->strlcpy() The strcpy() and sprintf() functions are deprecated on some platforms. Care is needed to ensure correct size is used. If some platforms miss snprintf, we can add a #define to sprintf, likewise strlcpy(). The biggest change is adding a size parameter to zfs_id_to_fuidstr(). The various *_impl_get() functions are only used on linux and have not yet been updated. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10400	2020-06-07 11:42:12 -07:00
Paul Dagnelie	99b281f1ae	Fix double mutex_init bug in send code It was possible to cause a kernel panic in the send code by initializing an already-initialized mutex, if a record was created with type DATA, destroyed with a different type (bypassing the mutex_destroy call) and then re-allocated as a DATA record again. We tweak the logic to not change the type of a record once it has been created, avoiding the issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10374	2020-06-03 19:53:21 -07:00
Ryan Moeller	a9dcfac51c	Periodically update ARC kstats FreeBSD needs arc_adjust_zthr to run periodically for kstats to be updated. A comment in the code suggests this may have been the original intent in illumos as well: `c946d5a913/module/zfs/arc.c (L4697-L4700)` Create the thread with a 1 second timer. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10371	2020-06-03 09:52:38 -07:00
Jorgen Lundman	70a5fc0530	Memory leak in dsl_destroy_snapshots_nvl error case The dsl_destroy_snapshots_nvl() function has an early error out, and temporary nvlists were not freed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10366	2020-05-26 16:13:41 -07:00
Brian Atkinson	fb822260b1	Gang ABD Type Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10069	2020-05-20 18:06:09 -07:00
DeHackEd	57434abae6	Use boot_ncpus in place of max_ncpus in taskq_create Due to hotplug support or BIOS bugs sometimes max_ncpus can be an absurdly high value. I have a system with 32 cores/threads but reports max_ncpus == 440. This many threads potentially cripples the system during arc_prune floods for example. boot_ncpus is the number of working CPUs when called so use that instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #10282	2020-05-20 10:07:21 -07:00
Matthew Ahrens	1b9cd1a9d9	Fix error handling in receive_writer_thread() If `receive_writer_thread()` gets an error from `receive_process_record()`, it should be saved in `rwa->err` so that we will stop processing records, and the main thread will notice that the receive has failed. When an error is first encountered, this happens correctly. However, if there are more records to dequeue, the next time through the loop we will reset `rwa->err` to zero, allowing us to try to process the following record (2 after the failed record). Depending on what types of records remain, we may incorrectly complete the receive "successfully", but without actually having processed all the records. The fix is to only set `rwa->err` if we got a non-zero error. This bug was introduced by #10099 "Improve zfs receive performance by batching writes". Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10320	2020-05-14 20:48:29 -07:00
Brian Behlendorf	2ade659eb4	Fix abd_enter/exit_critical wrappers Commit `fc551d7` introduced the wrappers abd_enter_critical() and abd_exit_critical() to mark critical sections. On Linux these are implemented with the local_irq_save() and local_irq_restore() macros which set the 'flags' argument when saving. By wrapping them with a function the local variable is no longer set by the macro and is no longer properly restored. Convert abd_enter_critical() and abd_exit_critical() to macros to resolve this issue and ensure the flags are properly restored. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10332	2020-05-14 20:45:16 -07:00
Jorgen Lundman	eeb8fae9c7	Upstream: add missing thread_exit() Undo FreeBSD wrapper for thread_create() added to call thread_exit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #10314	2020-05-14 15:58:09 -07:00
Matthew Ahrens	8b240f14f9	remove unneeded member drc_err of dmu_recv_cookie_t The member drc_err of dmu_recv_cookie_t is used only locally in receive_read, so we can replace it with a local variable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10319	2020-05-14 12:10:29 -07:00
John Poduska	41035a0496	Resilver restarts unnecessarily when it encounters errors When a resilver finishes, vdev_dtl_reassess is called to hopefully excise DTL_MISSING (amongst other things). If there are errors during the resilver, they are tracked in DTL_SCRUB, as spelled out in the block comment in vdev.c. DTL_SCRUB is in-core only, so it can only be used if the pool was online for the whole resilver. This state is tracked with the spa_scrub_started flag, which only gets set when the scan is initialized. Unfortunately, this flag gets cleared right before vdev_dtl_reassess gets called, so if there are any errors during the scan, DTL_MISSING will never get excised and the resilver will just continually restart. This fix simply moves clearing that flag until after the call to vdev_dtl_reasses. In addition, if a pool is imported and already has scn_errors > 0, this change will restart the resilver immediately instead of doing the rest of the scan and then restarting it from the beginning. On the other hand, if scn_errors == 0 at import, then no errors have been encountered so far, so the spa_scrub_started flag can be safely set. A test has been added to verify that resilver does not restart when relevant DTL's are available. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #10291	2020-05-13 10:54:27 -07:00
Brian Atkinson	fc551d7efb	Combine OS-independent ABD Code into Common Source File Reorganizing ABD code base so OS-independent ABD code has been placed into a common abd.c file. OS-dependent ABD code has been left in each OS's ABD source files, and these source files have been renamed to abd_os. The OS-independent ABD code is now under: module/zfs/abd.c With the OS-dependent code in: module/os/linux/zfs/abd_os.c module/os/freebsd/zfs/abd_os.c Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes #10293	2020-05-10 12:23:52 -07:00
George Amanakis	657fd33bcf	Improvements on persistent L2ARC Functional changes: We implement refcounts of log blocks and their aligned size on the cache device along with two corresponding arcstats. The refcounts are reflected in the header of the device and provide valuable information as to whether log blocks are accounted for correctly. These are dynamically adjusted as log blocks are committed/evicted. zdb also uses this information in the device header and compares it to the corresponding values as reported by dump_l2arc_log_blocks() which emulates l2arc_rebuild(). If the refcounts saved in the device header report higher values, zdb exits with an error. For this feature to work correctly there should be no active writes on the device. This is also employed in the tests of persistent L2ARC. We extend the structure of the cache device header by adding the two new variables mirroring the refcounts after the existing variables to preserve backward compatibility in terms of persistent L2ARC. 1) a new arcstat "l2_log_blk_asize" and refcount "l2ad_lb_asize" which reflect the total aligned size of log blocks on the device. This is also reflected in the header of the cache device as "dh_lb_asize". 2) a new arcstat "l2arc_log_blk_count" and refcount "l2ad_lb_count" which reflect the total number of L2ARC log blocks present on cache devices. It is also reflected in the header of the cache device as "dh_lb_count". In l2arc_rebuild_vdev() if the amount of committed log entries in a log block is 0 and the device header is valid we update the device header. This will facilitate trimming of the whole device in this case when TRIM for L2ARC is implemented. Improve loop protection in l2arc_rebuild() by using the starting offset of the payload of each log block instead of the starting offset of the log block. If the zio in l2arc_write_buffers() fails, restore the lbps array in the header of the device to its previous state in l2arc_write_done(). If l2arc_rebuild() ends the rebuild process without restoring any L2ARC log blocks in ARC and without any other error, this means that the lbps array in the header is pointing to non-existent or invalid log blocks. Reset the device header in this case. In l2arc_rebuild() change the zfs_dbgmsg messages to spa_history_log_internal() making them user visible with zpool history command. Non-functional changes: Make the first test in persistent L2ARC use `zdb -lll` to increase coverage in `zdb.c`. Rename psize with asize when referring to log blocks, since L2ARC_SET_PSIZE stores the vdev aligned size for log blocks. Also rename dh_log_blk_entries to dh_log_entries to make it clear that it is a mirror of l2ad_log_entries. Added comments for both changes. Fix inaccurate comments for example in l2arc_log_blk_restore(). Add asserts at the end in l2arc_evict() and l2arc_write_buffers(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10228	2020-05-07 16:34:03 -07:00
Paul Dagnelie	108a454a46	Add support for boot environment data to be stored in the label Modern bootloaders leverage data stored in the root filesystem to enable some of their powerful features. GRUB specifically has a grubenv file which can store large amounts of configuration data that can be read and written at boot time and during normal operation. This allows sysadmins to configure useful features like automated failover after failed boot attempts. Unfortunately, due to the Copy-on-Write nature of ZFS, the standard behavior of these tools cannot handle writing to ZFS files safely at boot time. We need an alternative way to store data that allows the bootloader to make changes to the data. This work is very similar to work that was done on Illumos to enable similar functionality in the FreeBSD bootloader. This patch is different in that the data being stored is a raw grubenv file; this file can store arbitrary variables and values, and the scripting provided by grub is powerful enough that special structures are not required to implement advanced behavior. We repurpose the second padding area in each label to store the grubenv file, protected by an embedded checksum. We add two ioctls to get and set this data, and libzfs_core and libzfs functions to access them more easily. There are no direct command line interfaces to these functions; these will be added directly to the bootloader utilities. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #10009	2020-05-07 09:36:33 -07:00
George Amanakis	1b664952ae	Enable splitting mirrors with indirect vdevs When a top-level vdev is removed from a pool it is converted to an indirect vdev. Until now splitting such mirrored pools was not possible with zpool split. This patch enables handling of indirect vdevs and splitting of those pools with zpool split. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10283	2020-05-06 10:32:28 -07:00
George Amanakis	fa25460538	Add missing zfs_refcount_destroy() in key_mapping_rele() Otherwise when running with reference_tracking_enable=TRUE mounting and unmounting an encrypted dataset panics with: Call Trace: dump_stack+0x66/0x90 slab_err+0xcd/0xf2 ? __kmalloc+0x174/0x260 ? __kmem_cache_shutdown+0x158/0x240 __kmem_cache_shutdown.cold+0x1d/0x115 shutdown_cache+0x11/0x140 kmem_cache_destroy+0x210/0x230 spl_kmem_cache_destroy+0x122/0x3e0 [spl] zfs_refcount_fini+0x11/0x20 [zfs] spa_fini+0x4b/0x120 [zfs] zfs_kmod_fini+0x6b/0xa0 [zfs] _fini+0xa/0x68c [zfs] __x64_sys_delete_module+0x19c/0x2b0 do_syscall_64+0x5b/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <tcaputi@datto.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10246	2020-04-28 09:53:45 -07:00
Tom Caputi	aa646323db	Fix missing ivset guid with resumed raw base recv This patch corrects a bug introduced in `61152d1069`. When resuming a raw base receive, the dmu_recv code always sets drc->drc_fromsnapobj to the object ID of the previous snapshot. For incrementals, this is correct, but for base sends, this should be left at 0. The presence of this ID eventually allows a check to run which determines whether or not the incoming stream and the previous snapshot have matching IVset guids. This check fails becuase it is not meant to run when there is no previous snapshot. When it does fail, the user receives an error stating that the incoming stream has the problem outlined in errata 4. This patch corrects this issue by simply ensuring drc->drc_fromsnapobj is left as 0 for base receives. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #10234 Closes #10239	2020-04-24 19:00:32 -07:00
Matthew Ahrens	196bee4cfd	Remove deduplicated send/receive code Deduplicated send streams (i.e. `zfs send -D` and `zfs receive` of such streams) are deprecated. Deduplicated send streams can be received by first converting them to non-deduplicated with the `zstream redup` command. This commit removes the code for sending and receiving deduplicated send streams. `zfs send -D` will now print a warning, ignore the `-D` flag, and generate a regular (non-deduplicated) send stream. `zfs receive` of a deduplicated send stream will print an error message and fail. The resulting code simplification (especially in the kernel's support for receiving dedup streams) should help enable future performance enhancements. Several new tests are added which leverage `zstream redup`. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Issue #7887 Issue #10117 Issue #10156 Closes #10212	2020-04-23 10:06:57 -07:00
Matthew Ahrens	32d805c3e2	Use a struct to organize metaslab-group-allocator fields Each metaslab group (of which there is one per top-level vdev) has several (4, by default) "metaslab group allocators". Each "allocator" has its own metaslab that it prefers to allocate from (the "primary" allocator), and each can perform allocations concurrently with the other allocators. In addition to the primary metaslab, there are several other fields that need to be tracked separately for each allocator. These are currently stored as several arrays in the metaslab_group_t, each array indexed by allocator number. This change organizes all the metaslab-group-allocator-specific fields into a new struct, metaslab_group_allocator_t. The metaslab_group_t now needs only one array indexed by the allocator number - which contains the metaslab_group_allocator_t's. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10213	2020-04-22 10:26:56 -07:00
Matthew Ahrens	1f043c8be1	Fix zfs send progress reporting The progress of a send is supposed to be reported by `zfs send -v`, but it is not. This works by creating a new user thread (with pthread_create()) which does ZFS_IOC_SEND_PROGRESS ioctls to check how much progress has been made. This IOCTL finds the specified send (since there may be multiple concurrent sends in the system). The IOCTL also checks that the specified send was started by the current process. On Linux, different threads of the same process are represented as different `struct task_struct`s (and, confusingly, have different PID's). To check if if two threads are in the same process, we need to check if they have the same `struct task_struct:group_leader`. We used to to this correctly, but it was inadvertently changed by `30af21b025` (Redacted Send) to simply check if the current `struct task_struct` is the one that started the send. This commit changes the code back to checking if the send was started by a `struct task_struct` with the same `group_leader` as the calling thread. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Chris Wedgwood <cw@f00f.org> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #10215 Closes #10216	2020-04-20 10:12:48 -07:00
George Amanakis	9249f1272e	Persistent L2ARC minor fixes Minor fixes on persistent L2ARC improving code readability and fixing a typo in zdb.c when byte-swapping a log block. It also improves the pesist_l2arc_007_pos.ksh test by giving it more time to retrieve log blocks on the cache device. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Adam D. Moss <c@yotes.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10210	2020-04-17 09:27:40 -07:00
Ryan Moeller	a7929f3137	Update FreeBSD tunables Remove some obsolete legacy compat, rename some misnamed, and add some missing tunables for FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10203	2020-04-15 11:14:47 -07:00
Brian Behlendorf	791e480c6a	Disable user space reference tracking The memory and cpu cost of reference count tracking with the current implementation is significant. For this reason it has always been disabled by default for the kmods. Apply this same default to user space so ztest doesn't always incur this performance penalty. Our intention is to re-enable this by default for ztest once the code has been optimized. Since we expect to at some point provide a FUSE implementation we wouldn't want this enabled by default for libzpool. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #10189	2020-04-13 10:51:44 -07:00
George Amanakis	77f6826b83	Persistent L2ARC This commit makes the L2ARC persistent across reboots. We implement a light-weight persistent L2ARC metadata structure that allows L2ARC contents to be recovered after a reboot. This significantly eases the impact a reboot has on read performance on systems with large caches. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Saso Kiselkov <skiselkov@gmail.com> Co-authored-by: Jorgen Lundman <lundman@lundman.net> Co-authored-by: George Amanakis <gamanakis@gmail.com> Ported-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #925 Closes #1823 Closes #2672 Closes #3744 Closes #9582	2020-04-10 10:33:35 -07:00
Ryan Moeller	36a6e2335c	Don't ignore zfs_arc_max below allmem/32 Set arc_c_min before arc_c_max so that when zfs_arc_min is set lower than the default allmem/32 zfs_arc_max can also be set lower. Add warning messages when tunables are being ignored. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10157 Closes #10158	2020-04-09 15:39:48 -07:00
Matthew Macy	8b27e08ed8	Add separate field for indicating that spa is in middle of split By default it's not possible to open a device already owned by an active vdev. It's necessary to make an exception to this for vdev split. The FreeBSD platform code will make an exception if spa_is splitting is set to to true. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10178	2020-04-09 09:59:31 -07:00
Matthew Macy	01c4f2bf29	Use vn_io_fault_uiomove on FreeBSD to avoid potential deadlock Added to prevent a possible deadlock, the following comments from FreeBSD explain the issue. The comment describing vn_io_fault_uiomove: /* * Helper function to perform the requested uiomove operation using * the held pages for io->uio_iov[0].iov_base buffer instead of * copyin/copyout. Access to the pages with uiomove_fromphys() * instead of iov_base prevents page faults that could occur due to * pmap_collect() invalidating the mapping created by * vm_fault_quick_hold_pages(), or pageout daemon, page laundry or * object cleanup revoking the write access from page mappings. * * Filesystems specified MNTK_NO_IOPF shall use vn_io_fault_uiomove() * instead of plain uiomove(). / This used for vn_io_fault which has the following motivation: / * The vn_io_fault() is a wrapper around vn_read() and vn_write() to * prevent the following deadlock: * * Assume that the thread A reads from the vnode vp1 into userspace * buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is * currently not resident, then system ends up with the call chain * vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] -> * vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2) * which establishes lock order vp1->vn_lock, then vp2->vn_lock. * If, at the same time, thread B reads from vnode vp2 into buffer buf2 * backed by the pages of vnode vp1, and some page in buf2 is not * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock. * * To prevent the lock order reversal and deadlock, vn_io_fault() does * not allow page faults to happen during VOP_READ() or VOP_WRITE(). * Instead, it first tries to do the whole range i/o with pagefaults * disabled. If all pages in the i/o buffer are resident and mapped, * VOP will succeed (ignoring the genuine filesystem errors). * Otherwise, we get back EFAULT, and vn_io_fault() falls back to do * i/o in chunks, with all pages in the chunk prefaulted and held * using vm_fault_quick_hold_pages(). * * Filesystems using this deadlock avoidance scheme should use the * array of the held pages from uio, saved in the curthread->td_ma, * instead of doing uiomove(). A helper function * vn_io_fault_uiomove() converts uiomove request into * uiomove_fromphys() over td_ma array. * * Since vnode locks do not cover the whole i/o anymore, rangelocks * make the current i/o request atomic with respect to other i/os and * truncations. */ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes #10177	2020-04-08 10:30:27 -07:00
Ryan Moeller	7e3df9db12	Finish refactoring for ZFS_MODULE_PARAM_CALL Linux and FreeBSD have different parameters for tunable proc handler. This has prevented FreeBSD from implementing the ZFS_MODULE_PARAM_CALL macro. To complete the sharing of ZFS_MODULE_PARAM_CALL declarations, create per-platform definitions of the parameter list, ZFS_MODULE_PARAM_ARGS. With the declarations wired up we discovered an incorrect scope prefix for spa_slop_shift, so this is now fixed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10179	2020-04-07 10:06:22 -07:00
Paul Dagnelie	5a42ef04fd	Add 'zfs wait' command Add a mechanism to wait for delete queue to drain. When doing redacted send/recv, many workflows involve deleting files that contain sensitive data. Because of the way zfs handles file deletions, snapshots taken quickly after a rm operation can sometimes still contain the file in question, especially if the file is very large. This can result in issues for redacted send/recv users who expect the deleted files to be redacted in the send streams, and not appear in their clones. This change duplicates much of the zpool wait related logic into a zfs wait command, which can be used to wait until the internal deleteq has been drained. Additional wait activities may be added in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #9707	2020-04-01 10:02:06 -07:00
George Amanakis	37c22948e5	Reset l2ad_hand and l2ad_first in l2arc_evict Increasing l2arc_write_size or l2arc_write_boost can result in l2arc_write_buffers() not having enough space to perform its writes and panic zio_write_phys(). Instead of resetting l2ad_hand to l2ad_start at the end of l2arc_write_buffers() and not taking into account a possible user-mediated increase of l2arc_write_max, we do this in l2arc_evict(), right after l2arc_write_size() has run. If there is not enough space to evict (ie we will exceed l2ad_end) we evict to the end of the device, reset l2ad_hand to l2ad_start, set l2ad_first to 0 and iterate l2arc_evict(). We avoid infinite iteration of l2arc_evict() by making sure in l2arc_write_size() that l2ad_start + size does not exceed l2ad_end. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10154	2020-03-31 10:46:48 -07:00
Ryan Moeller	9a51738b60	Let default arc_c_max be platform dependent Linux changed the default max ARC size to 1/2 of physical memory to deal with shortcomings of the Linux SLUB allocator. Other platforms do not require the same logic. Implement an arc_default_max() function to determine a default max ARC size in platform code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes #10155	2020-03-27 09:14:46 -07:00

1 2 3 4 5 ...

2389 Commits