Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks. Therefore, we clear __GFP_FS
for these allocations. The two deadlocks are:
* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
calls kmem_cache_alloc() which happens to need to allocate a
new slab. To allocate the new slab we enter FS level reclaim
and attempt to evict several inodes. To evict these inodes we
need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
just happens that obj1 and obj2 use the same hashed lock.
* Similar to the first case however instead of getting blocked
on the hash lock we block in txg_wait_open() which is waiting
for the next txg which isn't coming because the txg_sync
thread is blocked in kmem_cache_alloc().
Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS. However, it practice this is sufficient because
several very unlikely things must all occur concurrently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.
Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock. When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses. It can be
easily determined if the passed object is a virtual slab object
or an emergency object. This allows us to completely optimize
the emergency object free case out of the common free path.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the initial implementation emergency objects were tracked on a
per-cache list. The assumption was that under normal operation we
would never allocate more than a handful of these objects. So the
cost of walking the list during free was expected to be negligible.
However real world usage has shown that emergency objects tend to
be allocated in batches. A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.
Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks. In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.
With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time. Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects. This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.
The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Disable this test because it may result in an OOM event on the
system which can result in the test infrastructure being killed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The Fedora 3.6 debug kernel identified the following issue where
we create a thread under a spin lock. This isn't safe because
sleeping could result in a deadlock. Therefore the lock is changed
to a mutex so it's safe to sleep.
BUG: sleeping function called from invalid context at mm/slub.c:930
in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat
1 lock held by splat/10583:
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The Fedora 3.6 debug kernel identified the following issue where
we call copy_to_user() under a spin lock(). This used to be safe
in older kernels but no longer appears to be true so the spin
lock was changed to a mutex. None of this code is performance
critical so allowing the process to sleep is harmless.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().
This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.
In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move the kstat ks_update() callback under the ks_lock. This
enables dynamically sized kstats without modification to the
kstat API.
* Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
* Register a ->ks_update() callback which does:
o Frees any existing ks_data buffer.
o Set ks_data_size to the kstat array size.
o Set ks_data to an allocated buffer of size ks_data_size
o Populate the array of buffers with the required data.
The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer. The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access. To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.
Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests. Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.
This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#183
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.
This is good news for the vn implementation because it removes the
need for us to handle the locking. However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.
Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels. This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.
Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#154
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.
This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.
This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.
On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().
This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#168
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
This change was originally part of cd5ca4b but was reverted by
330fe01. It always should have had its own commit for exactly
this reason.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.
Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts. Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast(). We had some troubles with this previously but
there may still be a small race, see commit d599e4f.
The following patch closes one small race and improves the ASSERTs
such that they log the offending value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
The workspace required by zlib to perform compression is roughly
512MB (order-7). These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.
It is far preferable to asynchronously vmalloc an additional slab
in case it's needed. Then simply block waiting for an existing
object to be released or for the new slab to be allocated.
This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
Provide a flag to disable the use of emergency objects for a
specific kmem cache. There may be instances where under no
circumstances should you kmalloc() an emergency object. For
example, when you cache contains very large objects (>128k).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning. It
is in no way harmful so we suppress the warning.
SPL: large kmem_alloc(40960, 0x80d0) at
tsd_hash_table_init:358 (76495/76495)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#93
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test. These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented. And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.
This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.
Since this touches some common code all the tests which share
these data structions were also updated. After making these
changes slab_overcommit is reliably passing. However, there
is certainly additional cleanup which could be done here.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 2092cf68d8. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit b8b6e4c453. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address. Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 372c257233. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.
Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.
While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.
This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.
Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.
However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.
As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.
These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.
One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.
Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue https://github.com/zfsonlinux/spl/issues/26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.
The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.
Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.
The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.
Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#152
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#160
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Currently, the SPL tries to determine the hostid at module load. The
hostid is usually determined by running the userland program "hostid"
during module initialization.
Unfortunately, when the module initializes, it may be way too soon to be
able to run any userland programs. This is especially true when the
module is compiled directly inside the kernel (built-in); in that case,
the SPL would try to run hostid when the kernel is still initializing,
which of course is doomed to fail.
This patch fixes the issue by deferring hostid generation until
something actually needs the hostid (that is, when zone_get_hostid() is
called), thus switching to a "on-initialization" model to a "on-demand"
(lazy loading) model. ZFS only needs the hostid when some pool
operations are requested, and this always happens way after the kernel
has finished initialization, thus solving the problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building SPL as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the spl source package.
To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the spl
source package.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
Commit 3160d4f56b changed the set of
conditions under which spl_mutex_spin_max would be implemented as a
function by changing an #if in sys/mutex.h. The corresponding
implementation file spl-mutex.c, however, has not been updated to
reflect the change. This results in undefined reference errors on
spl_mutex_spin_max under the following condition:
((!CONFIG_SMP || CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR)
This patch fixes the issue by using the same #if in sys/mutex.h and
spl-mutex.c.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
In zfs, each module Makefile contains a MODULE variable which contains
the name of the module, and the following declarations reference this
variable.
In spl, there is a MODULES variable which is never used. Rename it to
MODULE and use it like in zfs. This improves consistency between the two
build systems.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
Explicitly cast the sizeof in hostid_read() to prevent the
following compiler warning on 32-bit systems.
module/spl/spl-generic.c:490:10: error: format '%lu' expects
argument of type 'long unsigned int', but argument 4 has type
'unsigned int' [-Werror=format]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.
A kernel patch is available in the Gentoo bug tracker for
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#126
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#131
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#119
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree. This
causes the following failure:
error: implicit declaration of function '__FD_CLR'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current
__clear_close_on_exec() interface for readability. Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#124
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized. In practice this will never occur but
to keep gcc happy we initialize the variable to zero.
Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
In the module unload path the vm_file_cache was being destroyed
under a spin lock. Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.
This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock. There is no need to lock this operation here.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#771
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes. This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#706
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#109
This test is designed to verify that direct reclaim is functioning as
expected. We allocate a large number of objects thus creating a large
number of slabs. We then apply memory pressure and expect that the
direct reclaim path can easily recover those slabs. The registered
reclaim function will free the objects and the slab shrinker will call
it repeatedly until at least a single slab can be freed.
Note it may not be possible to reclaim every last slab via direct reclaim
without a failure because the shrinker_rwsem may be contended. For this
reason, quickly reclaiming 3/4 of the slabs is considered a success.
This should all be possible within 10 seconds. For reference, on a
system with 2G of memory this test takes roughly 0.2 seconds to run.
It may take longer on larger memory systems but should still easily
complete in the alloted 10 seconds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered. This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.
Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress. Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made. Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations. However, since we are using none
of that infrastructure we're responsible for incrementing this
count.
If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked. If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107