These kstat interfaces are required to port
"Illumos #3537 want pool io kstats" to ZFS on Linux.
kstat_waitq_enter()
kstat_waitq_exit()
kstat_runq_enter()
kstat_runq_exit()
Additionally, zero out the ks_data buffer in __kstat_create() so
that the kstat_io_t counters are initialized to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
While porting Illumos #3537 I found that ks_lock member of kstat_t
structure is different between Illumos and SPL. It is a pointer to
the kmutex_t in Illumos, but the mutex lock itself in SPL.
Apparently Illumos kstat API allows consumer to override the lock
if required. With SPL implementation it is not possible anymore.
Things were alright until the first attempt to actually override
the lock. Porting of Illumos #3537 introduced such code for the
first time.
In order to provide the Solaris/Illumos like functionality we:
1. convert ks_lock to "kmutex_t *ks_lock"
2. create a new field "kmutex_t ks_private_lock"
3. On kstat_create() ks_lock = &ks_private_lock
Thus if consumer doesn't care we still have our internal lock in use.
If, however, consumer does care she has a chance to set ks_lock to
anything else before calling kstat_install().
The rest of the code will use ks_lock regardless of its origin.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
This reverts commit dba79fcbf2 in
favor of using the generic KSTAT_TYPE_RAW callbacks. The advantage
of this approach is that arbitrary types can be added without the
need to add them to the SPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
This change adds simple wrappers for accessing a thread's PID and
command character string.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
The current implementation for displaying kstats of type KSTAT_TYPE_RAW
is rather crude. This patch attempts to enhance this handling by
allowing a kstat user to register formatting callbacks which can
optionally be used.
The callbacks allow the user to implement functions for interpreting
their data and transposing it into a character buffer. This buffer,
containing a string representation of the raw data, is then be displayed
through the current /proc textual interface.
Additionally the kstats are made writable because it's now possible
to provide a useful handler via the existing ks_update() interface.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
This is needed for the Illumos #4045 write throttle patch. It is used
in the arc eviction code to avoid blocking all arc activity by sitting on
arcs_mtx too long.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.
We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.
Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#260
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.
This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:
http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html
Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#273
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.
Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Linux kernel commmit torvalds/linux@db3808c1 moved the
vmalloc_info structure from a private to a public header.
Now that it's available for kernel modules use it.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Ensure the value is cast to a 'long long' for printing purposes. The
expectation is that ASSERT0/VERIFY0 are mostly used for validating
return values and thus may commonly be negative.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #246
The Illumos code introduced the ASSERT0 and VERIFY0 macros which
are to be used instead of ASSERT3S(x, ==, 0) and VERIFY3S(x, ==, 0).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Madhav Suresh <madhav.suresh@delphix.com>
Closes#246
Somewhat amazingly it went unnoticed that the delay() function
doesn't actually cause the task to block. Since the task state
is never changed from TASK_RUNNING before schedule_timeout() the
scheduler allows to task to continue running without any delay.
Using schedule_timeout_interruptible() resolves the issue by
correctly setting TASK_UNINTERRUPTIBLE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add wrappers for the Solaris MSEC_TO_TICK, USEC_TO_TICK, and
NSEC_TO_TICK conversion functions. They are mapped directly to
their Linux counterparts with the exception of NSEC_TO_TICK
can cannot use usecs_to_jiffies() because it is not exported
by the kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Install the common spl kernel development headers under
/usr/src/spl-<version>/ rather than in a kernel specific
directory. The kernel specific build products such as
spl_config.h and Modules.symvers are left installed under
/usr/src/spl-<version>/<kernel>.
This was done to be consistent with where dkms expects
kernel module source to be packaged. It also allows for
a common spl-kmod-devel package which includes the headers,
and per-kernel spl-kmod-devel-<kernel> packages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linux 3.9 reorganized sched.h, splitting it into numerous files.
torvalds/linux@8bd75c77b7 moved MAX_PRIO
and MAX_RT_PRIO to linux/sched/rt.h.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Rather than use a custom install target it is cleaner to define
a 'kerneldir' and set 'kernel_HEADERS' appropriately. This
allows us to leverage the standing configure install support.
Additionally, I took this opertunity add the missing make files
to the include subdirectories.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The default permissions used by install are 755. Since this
file isn't executable 644 is more appropriate.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The new lz4 compression algorithm, zfsonlinux/zfs@9759c60, requires
the generic BE_IN16 and BE_IN32 functions. These are added to the SPL
for other consumers to take advantage of.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior. The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache. On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.
This behavior is now configurable through the 'spl_kmem_cache_expire'
module option. The value is a bit mask with the following meaning.
0x1 - Solaris style cache aging eviction is enabled.
0x2 - Linux style low memory eviction is enabled.
Both methods may be safely enabled simultaneously, but by default
both are disabled. It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good. It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes#210
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure. Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().
Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.
To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting. The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#206
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute. To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#699Closeszfsonlinux/zfs#859
All consumers of the kernel delayed work queues have been shifted
over to rely on the taskq implementation. This compatibility code
can now be removed. Any new callers which need this functionality
should use the taskq interfaces for delayed work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.
This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code. All that
is required is a thread context to perform the allocation in. The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces. This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.
However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function. The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu. It then blocks
waiting for them to complete and promptly reclaims any free slabs.
When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable. It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the ability to dispatch a delayed task to a taskq. The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached. To achieve this two
new functions were added.
* taskq_dispatch_delay() -
This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument. The caller should pass the
desired time the task should be executed as an absolute value in
jiffies. The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.
* taskq_cancel_id() -
Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task. There are three possible
return values.
0 - The task was found and canceled before it was executed.
ENOENT - The task was not found, either it was already run or an
invalid task id was supplied by the caller.
EBUSY - The task is currently executing any may not be canceled.
This function will block until the task has been completed.
* taskq_wait_all() -
The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior. It is only curreny used by
the splat taskq regression tests.
* taskq_wait_id() -
Historically, the only difference between this function and
taskq_wait() was that you passed the task id. In both functions you
would block until ALL lower task ids which executed. This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.
To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.
This is actually a fairly low risk change for a few reasons.
* Only new ZFS callers will make use of the new interfaces and
very little common code was changed to support the new functions.
* The existing taskq_wait() implementation was not changed just
slightly refactored.
* The newly optimized taskq_wait_id() implementation was never
used by ZFS we can't accidentally introduce a new bug there.
NOTE: This functionality does not exist in the Illumos taskqs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq implementation was originally written I wrapped all
the API functions in #define's. This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.
However, in practice the taskq symbols never conflicted. The only
major conflicts occured with the kmem cache API. Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the taskq implementation to conform with the style used
throughout the rest of the code. There are no functional
changes in this commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the initial implementation emergency objects were tracked on a
per-cache list. The assumption was that under normal operation we
would never allocate more than a handful of these objects. So the
cost of walking the list during free was expected to be negligible.
However real world usage has shown that emergency objects tend to
be allocated in batches. A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.
Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks. In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.
With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time. Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects. This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.
The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().
This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.
In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move the kstat ks_update() callback under the ks_lock. This
enables dynamically sized kstats without modification to the
kstat API.
* Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
* Register a ->ks_update() callback which does:
o Frees any existing ks_data buffer.
o Set ks_data_size to the kstat array size.
o Set ks_data to an allocated buffer of size ks_data_size
o Populate the array of buffers with the required data.
The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer. The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access. To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.
Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests. Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.
This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#183
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.
This is good news for the vn implementation because it removes the
need for us to handle the locking. However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.
Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels. This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.
Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#154
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.
This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.
This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.
On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().
This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#168
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.
Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts. Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a flag to disable the use of emergency objects for a
specific kmem cache. There may be instances where under no
circumstances should you kmalloc() an emergency object. For
example, when you cache contains very large objects (>128k).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
See dechamps/zfs@cc6cd40ad7 for details.
This harmless addition was merged to simplify testing the ZFS TRIM
support patches.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#167
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system. To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs. This will prevent them from attempting
to initiate any I/O during direct reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Generate an assertion if we're going to deadlock the system by
attempting to acquire a mutex the process is already holding.
There are currently no known instances of this under normal
operation, but it _might_ be possible when using a ZVOL as a
swap device. I want to ensure we catch this immediately if it
were to occur.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
PF_NOFS is a per-process debug flag which is set in current->flags to
detect when a process is performing an unsafe allocation. All tasks
with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because
if they enter direct reclaim and initiate I/O they may deadlock.
When debugging is disabled, any incorrect usage will be detected and
a call stack with a warning will be printed to the console. The flags
will then be automatically corrected to allow for safe execution. If
debugging is enabled this will be treated as a fatal condition.
To avoid any risk of conflicting with the existing PF_ flags. The
PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when
CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused,
will the PF_NOFS bit be valid. Happily, most existing distributions
ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 372c257233. The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks. Those issues are believed to be resolved so this
workaround can be safely reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs. The issue is that the Linux kernel does
not honor the flags passed to __vmalloc(). This makes it unsafe
to use in a writeback context. Unfortunately, this is a use case
ZFS depends on for correct operation.
Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.
https://bugs.gentoo.org/show_bug.cgi?id=416685
However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.
While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code. This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.
This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue. This doesn't prevent the posibility
of the worker thread from deadlocking. However, the caller can now
safely block on a wait queue for the slab allocation to complete.
Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,. The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.
However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out. When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.
As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.
These emergency allocations will likely be slow since they require
contiguous pages. However, their use should be rare so the impact
is expected to be minimal. If that turns out not to be the case in
practice further optimizations are possible.
One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed. Is they accumulate on a system and the list
grows freeing objects will become more expensive. This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.
Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented. See issue https://github.com/zfsonlinux/spl/issues/26
for full details. This work would also help reduce ZFS's memory
fragmentation problems.
The /proc/spl/kmem/slab file has had two new columns added at the
end. The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case. These value should give us a good idea
of how often these objects are needed. Based on these values under
real use cases we can tune the default behavior.
Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock. This should manifest itself as reduced cpu usage for
the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>