The ARC shrinker callback `arc_shrinker_count/_scan()` is invoked by the
kernel's shrinker mechanism when the system is running low on free
pages. This happens via 2 code paths:
1. "direct reclaim": The system is attempting to allocate a page, but we
are low on memory. The ARC shrinker callback is invoked from the
page-allocation code path.
2. "indirect reclaim": kswapd notices that there aren't many free pages,
so it invokes the ARC shrinker callback.
In both cases, the kernel's shrinker code requests that the ARC shrinker
callback release some of its cache, and then it measures how many pages
were released. However, it's measurement of released pages does not
include pages that are freed via `__free_pages()`, which is how the ARC
releases memory (via `abd_free_chunks()`). Rather, the kernel shrinker
code is looking for pages to be placed on the lists of reclaimable pages
(which is separate from actually-free pages).
Because the kernel shrinker code doesn't detect that the ARC has
released pages, it may call the ARC shrinker callback many times,
resulting in the ARC "collapsing" down to `arc_c_min`. This has several
negative impacts:
1. ZFS doesn't use RAM to cache data effectively.
2. In the direct reclaim case, a single page allocation may wait a long
time (e.g. more than a minute) while we evict the entire ARC.
3. Even with the improvements made in 67c0f0dedc ("ARC shrinking blocks
reads/writes"), occasionally `arc_size` may stay above `arc_c` for the
entire time of the ARC collapse, thus blocking ZFS read/write operations
in `arc_get_data_impl()`.
To address these issues, this commit limits the ways that the ARC
shrinker callback can be used by the kernel shrinker code, and mitigates
the impact of arc_is_overflowing() on ZFS read/write operations.
With this commit:
1. We limit the amount of data that can be reclaimed from the ARC via
the "direct reclaim" shrinker. This limits the amount of time it takes
to allocate a single page.
2. We do not allow the ARC to shrink via kswapd (indirect reclaim).
Instead we rely on `arc_evict_zthr` to monitor free memory and reduce
the ARC target size to keep sufficient free memory in the system. Note
that we can't simply rely on limiting the amount that we reclaim at once
(as for the direct reclaim case), because kswapd's "boosted" logic can
invoke the callback an unlimited number of times (see
`balance_pgdat()`).
3. When `arc_is_overflowing()` and we want to allocate memory,
`arc_get_data_impl()` will wait only for a multiple of the requested
amount of data to be evicted, rather than waiting for the ARC to no
longer be overflowing. This allows ZFS reads/writes to make progress
even while the ARC is overflowing, while also ensuring that the eviction
thread makes progress towards reducing the total amount of memory used
by the ARC.
4. The amount of memory that the ARC always tries to keep free for the
rest of the system, `arc_sys_free` is increased.
5. Now that the shrinker callback is able to provide feedback to the
kernel's shrinker code about our progress, we can safely enable
the kswapd hook. This will allow the arc to receive notifications
when memory pressure is first detected by the kernel. We also
re-enable the appropriate kstats to track these callbacks.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10600
FreeBSD recently integrated a change which causes \s in a regex to
throw an error instead of silently being misinterpreted as an s.
Change the regex in zpool_colors.ksh to use [[:space:]].
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#10651
FreeBSD uses more stack space in debug configurations and can overflow
the stack while formatting the error message when the call depth limit
of 20 frames is reached. This is readily reproduced by running the
gsub recursion test with increased kstack size. I hit the panic with
16 pages per kstack, and noticed it go away when bumped to 17.
Reserve an additional 64 bytes on the stack when building for FreeBSD.
This is enough to avoid the panic with a deep stack while not wasting
too much space when the default stack size is used.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10634
Rather than just saying there was an internal error, provide any
context we might have to the user to help them understand the issue.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10632
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10636
This was causing all later errno's to have the incorrect value.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#10649
When a clone is promoted, its livelist is no longer accurate, so it is
discarded. If the clone's origin is also a clone (i.e. we are promoting
a clone of a clone), then the origin's livelist is also no longer
accurate, so it should be discarded, but the code doesn't actually do
that.
Consider a pool with:
* Filesystem A
* Clone B, a clone of A
* Clone C, a clone of B
If we promote C, it discards C's livelist. It should discard B's
livelist, but that is not happening. The impact is that when B is
destroyed, we use the livelist to find the blocks to free, but the
livelist is no longer correct so we end up freeing blocks that are still
in use by C. The incorrectly-freed blocks can be reallocated causing
checksum errors. And when C is destroyed it can double-free the
incorrectly-freed blocks.
The problem is that we remove the livelist of `origin_ds->ds_dir`, but
the origin snapshot has already been moved to the promoted dsl_dir. So
this is actually trying to remove the livelist of the promoted dsl_dir,
which was already removed. As explained in a comment in the beginning
of `dsl_dataset_promote_sync()`, we need to use the saved `odd` for the
origin's dsl_dir.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10652
* Fixed a typo that cause one of the variations to be a no-op
* Added additional coverage for adding special vdev after pool create
* Added additional coverage for using 4K sector size
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#10641
In `vdev_load()`, we look up several entries in the `vdev_top_zap`
object. In most cases, if we encounter an i/o error, it will be
returned to the caller. However, when handling
`VDEV_TOP_ZAP_ALLOCATION_BIAS`, if we get an i/o error, we may continue
on, which in theory could cause us to not realize that a vdev should be
used only for `special` allocations.
In practice, if we encountered an i/o error while looking for
`VDEV_TOP_ZAP_ALLOCATION_BIAS` in the `vdev_top_zap`, we'd also get an
i/o error while looking for other entries in the same object, and thus
the zpool open/import would fail. Therefore the impact of this problem
is negligible.
This commit adds error handling for i/o errors while accessing the
`vdev_top_zap`, so that we aren't relying on unrelated code to fail for
us.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10637
This is a minor change to the systemd service templates that verifies the zfs
kernel module is loaded by the kernel prior to attempting to import any zpool.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jonathon Fernyhough <jonathon.fernyhough@york.ac.uk>
Closes#10627
Renamed to avoid conflicting with refcount.h when a different
implementation is already provided by the platform.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10620
When debugging issues or generally analyzing the runtime of
a system it would be nice to be able to tell the different
ZTHRs running by name rather than having to analyze their
stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#10630
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10621
The `zfs program` subcommand invokes a LUA interpreter to run ZFS
"channel programs". This interpreter runs in a constrained environment,
with defined memory limits. The LUA stack (used for LUA functions that
call each other) is allocated in the kernel's heap, and is limited by
the `-m MEMORY-LIMIT` flag and the `zfs_lua_max_memlimit` module
parameter. The C stack is used by certain LUA features that are
implemented in C. The C stack is limited by `LUAI_MAXCCALLS=20`, which
limits call depth.
Some LUA C calls use more stack space than others, and `gsub()` uses an
unusually large amount. With a programming trick, it can be invoked
recursively using the C stack (rather than the LUA stack). This
overflows the 16KB Linux kernel stack after about 11 iterations, less
than the limit of 20.
One solution would be to decrease `LUAI_MAXCCALLS`. This could be made
to work, but it has a few drawbacks:
1. The existing test suite does not pass with `LUAI_MAXCCALLS=10`.
2. There may be other LUA functions that use a lot of stack space, and
the stack space may change depending on compiler version and options.
This commit addresses the problem by adding a new limit on the amount of
free space (in bytes) remaining on the C stack while running the LUA
interpreter: `LUAI_MINCSTACK=4096`. If there is less than this amount
of stack space remaining, a LUA runtime error is generated.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10611Closes#10613
This is a step toward being able to vendor the OpenZFS code in FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10625
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10623
i386 has some additional memory reservation logic that limits the size
of the reported available memory. This was accidentally being used on
all arches due to a missing header.
Include machine/vmparam.h in freebsd/zfs/arc_os.c to pull in the
missing UMA_MD_SMALL_ALLOC definition.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10616
This is only used for the kstat, but something other than 0 is nice.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10626
By design a gang ABD can not have another gang ABD as a child. This is
to make sure the logical offset in a gang ABD is consistent with the
individual ABDS it contains as children. If a gang ABD is added as a
child of a gang ABD we will add the individual children of the gang ABD
to the parent gang ABD. This allows for a consistent view of offsets
within the parent gang ABD.
Reviewed-by: Mark Maybee <mmaybee@cray.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#10430
Set the initial max sizes to ULONG_MAX to allow the caches to grow
with the ARC.
Recalculate the metadata cache size on demand so it can adapt, too.
Update descriptions in zfs-module-parameters(5).
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10563Closes#10610
By default, `spl_kmem_cache_expire` is `KMC_EXPIRE_MEM`, meaning that
objects will be removed from kmem cache magazines by
`spl_kmem_cache_reap_now()`.
There is also a module parameter to change this to `KMC_EXPIRE_AGE`,
which establishes a maximum lifetime for objects to stay in the
magazine. This setting has rarely, if ever, been used, and is not
regularly tested.
This commit removes the code for `KMC_EXPIRE_AGE`, and associated module
parameters.
Additionally, the unused module parameter
`spl_kmem_cache_obj_per_slab_min` is removed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10608
Adding a new subcommand to zstream called token. This
now allows users to decode a resume token to retrieve the toname
field. This can be useful for tools that need this information.
The syntax works as follows zstream token <resume_token>.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Tony Perkins <tperkins@datto.com>
Closes#10558
* libspl: umem: These are obviously and intentionally unused; annotate
them as such to appease -Wunused-parameter builds that include this
header.
* sys/dmu.h: In this case, clear_on_evict_dbufp is only used for
ZFS_DEBUG builds, so annotate it as __maybe_unused to appease
-Wunused-parameter.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#10606
Drop unnecessary redefinition's of several arcstat values.
Put missing extern declaration of arc_no_grow_shift in arc_impl.h.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10609
zfs_path_to_zhandle has no need to mutate the path argument,
most notably:
- zfs_open takes path as const
- getextmntent takes path as const
- fprintf most clearly doesn't need to mutate it
It's hard to foresee any reason that libzfs could conceivably
want to mutate it in the future, either, so const'ify it.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes#10605
Update stale references to "ZFS on Linux" to "OpenZFS" in
CONTRIBUTING.md.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10602
When libudev is installed on FreeBSD, configure finds it and sets
WANT_DEVNAME2DEVID, but it isn't found by the linker because we
didn't specify where it is.
Use LIBUDEV_LIBS so the location of the library gets added to the
linker flags for devname2devid.
Also use LIBUDEV_CFLAGS here in case some other platform needs it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10590
Commit 109d2c9310 ("Move zfs_gitrev.h to build directory") stopped
distributing zfs_gitrev.h, as it is a generated file. Add it back, with
some changes in behavior.
Change the logic for gitrev as follows
- if the source tree is a git repository, the behavior for build is
unchanged. For make dist, append -dist to the git tag in the
distributed version of zfs_gitrev.h.
- otherwise, check if the source tree contains zfs_gitrev.h, and use it
if so, falling back to "unknown" if it doesn't exist.
- clean it only in make maintainer-clean, so we don't remove it from the
source tree on make clean or make distclean.
This allows disted sources to track what git tag they originally came
from, with the -dist suffix indicating that the code wasn't built
directly from git and so might contain additional changes beyond the git
tag.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Eli Schwartz <eschwartz@archlinux.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10595
Commit 109d2c9310 ("Move zfs_gitrev.h to build directory") removed
scripts/make_gitrev.sh, putting the logic into the Makefile itself.
However, at least the Arch Linux packager wants the script so that the
file can be generated without having to run configure first, for
DKMS packaging purposes.
So move the make recipe back into the script.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Eli Schwartz <eschwartz@archlinux.org>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10595
The process of evicting data from the ARC is referred to as
`arc_adjust`.
This commit changes the term to `arc_evict`, which is more specific.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10592
The DKMS module installs the entire source tree, including the .in files
that will later be substituted when building. This makes
brp_mangle_shebangs complain about shebang lines in the .in files.
Exclude everything under /usr/src from shebang mangling in the DKMS
package.
The KMOD package doesn't contain any of the files it excludes from
mangling, so just drop the exclusion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: João Carlos Mendes Luís <jonny@jonny.eng.br>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10581Closes#10582
These tunables were renamed from vfs.zfs.arc_min and
vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max.
Add legacy compat tunables for the old names.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10579
The unit was failing instead of stopping if someone manually unloaded
the key before stopping the unit (zfs unload-key is failing on an
unavailable key).
Follow a similar logic than for loading the key, checking for the key
status before unloading it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
We need a stronger dependency between the mount unit and its keyload unit
when we know that the dataset is encrypted.
If the keyload unit fails, Wants= will still try to mount the dataset,
which will then fail.
It’s better to show that the failure is due to a dependency failing, the
keyload unit, by tighting up the dependency. We can do this as we know
that we generate both units in the generator and so, it’s not an
optional dependency.
BindsTo enable as well that if the keyload unit fails at any point, the
associated mountpoint will be then unmounted.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
Drop Before=zfs.mount dependency explicity on generated key-load .service
unit.
Indeed, the associated mount unit is After=<dataset-key-load>.service.
This is thus the mount point which controls at what point it wants to be
mounted (Before=zfs-mount.service in stock generator), but this can be
an automount point, or triggered by another service.
This additional dependency from the key load service is not needed thus.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Didier Roche <didrocks@ubuntu.com>
Signed-off-by: Didier Roche <didrocks@ubuntu.com>
Closes#10477
The SPL kmem_cache implementation provides a mechanism, `skc_reclaim`,
whereby individual caches can register a callback to be invoked when
there is memory pressure. This mechanism is used in only one place: the
ARC registers the `hdr_recl()` reclaim function. This function wakes up
the `arc_reap_zthr`, whose job is to call `kmem_cache_reap()` and
`arc_reduce_target_size()`.
The `skc_reclaim` callbacks are invoked only by shrinker callbacks and
`arc_reap_zthr`, and only callback only wakes up `arc_reap_zthr`. When
called from `arc_reap_zthr`, waking `arc_reap_zthr` is a no-op. When
called from shrinker callbacks, we are already aware of memory pressure
and responding to it. Therefore there is little benefit to ever calling
the `hdr_recl()` `skc_reclaim` callback.
The `arc_reap_zthr` also wakes once a second, and if memory is low when
allocating an ARC buffer. Therefore, additionally waking it from the
shrinker calbacks has little benefit.
The shrinker callbacks can be invoked very frequently, e.g. 10,000 times
per second. Additionally, for invocation of the shrinker callback,
skc_reclaim is invoked many times. Therefore, this mechanism consumes
significant amounts of CPU time.
The kmem_cache shrinker calls `spl_kmem_cache_reap_now()`, which,
in addition to invoking `skc_reclaim()`, does two things to attempt to
free pages for use by the system:
1. Return free objects from the magazine layer to the slab layer
2. Return entirely-free slabs to the page layer (i.e. free pages)
These actions apply only to caches implemented by the SPL, not those
that use the underlying kernel SLAB/SLUB caches. The SPL caches are
used for objects >=32KB, which are primarily linear ABD's cached in the
DBUF cache.
These actions (freeing objects from the magazine layer and returning
entirely-free slabs) are also taken whenever a `kmem_cache_free()` call
finds a full magazine. So there would typically be zero entirely-free
slabs, and the number of objects in magazines is limited (typically no
more than 64 objects per magazine, and there's one magazine per CPU).
Therefore the benefit of `spl_kmem_cache_reap_now()`, while nonzero, is
modest.
We also call `spl_kmem_cache_reap_now()` from the `arc_reap_zthr`, when
memory pressure is detected. Therefore, calling
`spl_kmem_cache_reap_now()` from the kmem_cache shrinker is not needed.
This commit removes the `skc_reclaim` mechanism, its only callback
`hdr_recl()`, and the kmem_cache shrinker callback.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10576
Stock kernels older than 4.10 do not export the has_capability()
function which is required by commit e59a377. To avoid breaking
the build on older kernels revert to the safe legacy behavior and
return EACCES when privileges cannot be checked.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10565Closes#10573
`arc_free_memory()` returns the amount of memory that the ARC considers
to be free. This includes pages that are not actually free, but can be
evicted with essentially zero cost (without doing any i/o), for example
the page cache. The ARC can "squeeze out" any pages included in this
calculation, leaving only `arc_sys_free` (1/64th of RAM) for these
free/evictable pages.
Included in the count of free/evictable pages is
`nr_inactive_anon_pages()`, which is described as "Anonymous memory that
has not been used recently and can be swapped out". These pages would
have to be written out to disk (swap) in order to evict them, and they
are not included in `/proc/meminfo`'s `MemAvailable`.
Therefore it is not appropriate for `nr_inactive_anon_pages()` to be
included in the free/evictable memory returned by `arc_free_memory()`,
because the ARC shouldn't (intentionally) make the system swap.
This commit removes `nr_inactive_anon_pages()` from the memory returned
by `arc_free_memory()`. This is a step towards enabling the ARC to
manage free memory by monitoring it and reducing the ARC size as we
notice that there is insufficient free memory (in the `arc_reap_zthr`),
rather than the current method of relying on the `arc_shrinker`
callback.
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10575
Update the zfs commands such that they're backwards compatible with
the version of ZFS is the base FreeBSD.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10542
The following test cases have been observed to fail frequently
enough to be a problem when reporting CI results. Until they can
be updated to be entirely reliable add them to the zts-report.py
script.
alloc_class/alloc_class_011_neg
cli_root/zpool_import/zpool_import_012_pos
mmp/mmp_on_uberblocks
rsend/send_partial_dataset
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10578
FreeBSD stat uses -f to specify the format string rather than -c.
list_file_blocks in blkdev.shlib uses stat -c %i to get a file's
object ID for zdb. We already have a library function to do this
portably.
Use get_objnum to get the file's object ID.
Take log_must off of the call to list_free_blocks in
corrupt_blocks_at_level, which had masked the error. It was not good
to pipe the output of log_must into the while-loop, anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10572
Move/add include of <linux/percpu_compat.h> to satisfy missing
requirements.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain@dolbeau.org>
Closes#10568Closes#10569
Livelists and spacemaps are data structures that are logs of allocations
and frees. Livelists entries are block pointers (blkptr_t). Spacemaps
entries are ranges of numbers, most often used as to track
allocated/freed regions of metaslabs/vdevs.
These data structures can become self-inconsistent, for example if a
block or range can be "double allocated" (two allocation records without
an intervening free) or "double freed" (two free records without an
intervening allocation).
ZDB (as well as zfs running in the kernel) can detect these
inconsistencies when loading livelists and metaslab. However, it
generally halts processing when the error is detected.
When analyzing an on-disk problem, we often want to know the entire set
of inconsistencies, which is not possible with the current behavior.
This commit adds a new flag, `zdb -y`, which analyzes the livelist and
metaslab data structures and displays all of their inconsistencies.
Note that this is different from the leak detection performed by
`zdb -b`, which checks for inconsistencies between the spacemaps and the
tree of block pointers, but assumes the spacemaps are self-consistent.
The specific checks added are:
Verify livelists by iterating through each sublivelists and:
- report leftover FREEs
- report double ALLOCs and double FREEs
- record leftover ALLOCs together with their TXG [see Cross Check]
Verify spacemaps by iterating over each metaslab and:
- iterate over spacemap and then the metaslab's entries in the
spacemap log, then report any double FREEs and double ALLOCs
Verify that livelists are consistenet with spacemaps. The space
referenced by livelists (after using the FREE's to cancel out
corresponding ALLOCs) should be allocated, according to the spacemaps.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-66031
Closes#10515
A bunch of places need to edit files to incorporate the configured paths
i.e. bindir, sbindir etc. Move this logic into a common file.
Create arc_summary by copying arc_summary[23] as appropriate at build
time instead of install time.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10559
The configure variables won't be defined if CONFIG_KERNEL is disabled
and defining empty macros causes errors. The spec files do provide some
defaults if the macros are undefined.
Remove config conditionals in the tgz Makefile.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10564