The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC. That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks. This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.
The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it. This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.
As such, we revert this patch in favor of a proper fix for both issues.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC. Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Generate an assertion if we're going to deadlock the system by
attempting to acquire a mutex the process is already holding.
There are currently no known instances of this under normal
operation, but it _might_ be possible when using a ZVOL as a
swap device. I want to ensure we catch this immediately if it
were to occur.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
PF_NOFS is a per-process debug flag which is set in current->flags to
detect when a process is performing an unsafe allocation. All tasks
with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because
if they enter direct reclaim and initiate I/O they may deadlock.
When debugging is disabled, any incorrect usage will be detected and
a call stack with a warning will be printed to the console. The flags
will then be automatically corrected to allow for safe execution. If
debugging is enabled this will be treated as a fatal condition.
To avoid any risk of conflicting with the existing PF_ flags. The
PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when
CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused,
will the PF_NOFS bit be valid. Happily, most existing distributions
ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 2092cf68d8. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit b8b6e4c453. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address. Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 372c257233. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.
Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.
While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.
This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.
Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.
However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.
As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.
These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.
One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.
Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue https://github.com/zfsonlinux/spl/issues/26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.
The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.
Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since removing the check for CONFIG_PREEMPT, there are no consumers of
the SPL_LINUX_CONFIG macro. As such, there is no reason to keep it
around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#164
The autoconf macro which failed if CONFIG_PREEMPT was set in the kernel
config was removed. With the inclusion of a few previous patches
targeting support for preempt enabled kernels, it is now safe to run
with this kernel config option enabled.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#83
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#718
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries. However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has
.. as its final component. POSIX.1-2001 also allows EEXIST for
this condition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#895
Make name in Linux menuconfig consistent with those of other filesystems
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#897
ZFS fails to build when SPL is built into the kernel on unless
--with-spl=/path/to/kernel/sources is specified. We fallback to the
kernel sources directory when SPL is not found elsewhere to resolve
that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closed#896
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update. When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit(). Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.
This could be left to the sa_update() caller. But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#503Closes#513
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727. It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#709Closes#727
To support preempt enabled kernels in ZFS on Linux, there are a couple
places where the ZFS code needs to disable interrupts. This change adds
the Solaris preempt functions and maps them to the equivalent ZFS
functions, allowing the ZFS to make use of them.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.
The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
./configure erroneously detects absence of dops->d_automount
when built against a GrSecurity patched kernel.
Summerized error message found in config.log:
checking whether dops->d_automount() exists
...
In function 'main': ... error: constified variable 'dops'
cannot be local
The "dops" variable cannot be a local variable, so it's
moved to the global scope.
This test also fails if the prototype of the dops->d_automount
function pointer is changed.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes#884
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.
Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#152
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" from a read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/1796https://www.illumos.org/issues/2871https://www.illumos.org/issues/2903https://www.illumos.org/issues/2957
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/2635
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#717
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#160
This regression was accidentally introduced by commit
330d06f90d due to ZoL
specific code. The fix is to simply ensure the passed
nvlist is initialized and freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#876
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols. Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the /usr/src/zfs-<version>-<release>/<kernel> directory to
the zfs-modules-devel package. This ensures that this directory
will be removed when the package is removed.
We do not include the higher level /usr/src/zfs-<version>-<release>
directory since there may be builds for multiple kernels. Instead,
a %postun rmdir is added which attempts to remove this directory.
It will only succeed when the last zfs-modules-devel-* package
for this specific release is removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the /usr/src/spl-<version>-<release>/<kernel> directory to
the spl-modules-devel package. This ensures that this directory
will be removed when the package is removed.
We do not include the higher level /usr/src/spl-<version>-<release>
directory since there may be builds for multiple kernels. Instead,
a %postun rmdir is added which attempts to remove this directory.
It will only succeed when the last spl-modules-devel-* package
for this specific release is removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold. This isn't harmful so we suppress
the warning for this case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#628
RPM versions 4.8 and 4.9 differ in the definition of macro %_mandir:
$ rpm --version ; rpm --showrc | grep ^-14:._mandir
RPM version 4.9.0
-14: _mandir %{_prefix}/share/man
$ rpm --version ; rpm --showrc | grep ^-14:._mandir
RPM version 4.8.0
-14: _mandir /usr/share/man
zfs.spec.in defines %_prefix as /, so man pages end up getting
installed in /share/man on RPM 4.9 systems. To fix this, define
%_mandir relative to %_datadir in the spec file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#353
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.
Remove three stale prototype definites from dbuf.h. The actual
implementations of these functions were removed/renamed long ago.
It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building a zfs-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.
By default zfs-modules-dkms-* sub package will be built as part of
the 'make rpm' target. Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.
Examples:
# To build packaged binaries as well as a dkms packages
$ ./configure && make rpm
# To build only the packaged binary utilities and dkms packages
$ ./configure && make rpm-utils rpm-dkms
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
supported for building the dkms sub package.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #535
When checking for the SPL Module.symvers file, a timeout can now be
passed in which will pause the configure step while it waits for this
file to be generated. By default, the configure behavior is unchanged as
a timeout of 0 is used. If a positive number of seconds is passed,
configure will wait that number of seconds for the Module.symvers file
before moving on.
The main motivation for this change was to support parallel execution of
'./configure && make' for the SPL and ZFS packages in preparation of
supporting DKMS based packages.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building a spl-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.
By default spl-modules-dkms-* sub package will be built as part of
the 'make rpm' target. Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.
Examples:
# To build packaged binaries as well as a dkms packages
$ ./configure && make rpm
# To build only the packaged binary utilities and dkms packages
$ ./configure && make rpm-utils rpm-dkms
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
supported for building the dkms sub package.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#535
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1693
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#678