Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes#474
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing. An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock. The patch will need to be reworked.
VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".
Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes#468
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error. Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface. As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.
vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653
Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes#471
This patch reverts 77ab5dd. This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places. Those places have subsequently been updated to use
the Linux equivalent Linux functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637
On Linux the meaning of a processes priority is inverted with respect
to illumos. High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
Note this change also reverts some of the priorities changes in prior
commit 62aa81a. The rational is as follows:
spl_kmem_cache - High priority may result in blocked memory allocs
spl_system_taskq - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy. They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.
Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority. Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority. This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.
All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri. The vast majority of callers were
part of the test suite which won't have an external impact. The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri. This makes it more likely the process will be scheduled
which may help performance.
To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added. When disabled (0) all newly created kernel
threads will use the default kernel thread priority. When enabled (1)
the specified taskq priority will be used. By default this value is
enabled (1).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:
Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,
Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
The function vmem_qcache_reap() and global variables 'needfree',
'desfree', and 'lotsfree' are all used in the upstream. While
these variables have no meaning under Linux they're being defined
as 0's to avoid needing to make additional changes to the ARC code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system. Additional
threads will be created on demand up to the previous maximum
thread counts. This should have minimal, if any, impact on
performance.
This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.
The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#458
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics. Initially only a single worker thread will be created
to service tasks dispatched to the queue. As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'. When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.
Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.
* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned. Default 4.
* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system. This is provided
for the purposes of debugging and troubleshooting. Default 1
(enabled).
This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#458
Added for upstream compatibility, they are of the form:
* IMPLY(a, b) - if (a) then (b)
* EQUIV(a, b) - if (a) then (b) *AND* if (b) then (a)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit f752b46e added the cv_wait_interruptible() function to allow
condition variables to be woken by signals. This function and its
timed wait counterpart should have been named cv_wait_sig() to match
the illumos interface which provides the same functionality.
This patch renames the symbol but leaves a #define compatibility
wrapper in place until the ZFS code can be moved to the correct
name.
This patch also makes a small number of cosmetic changes to make
the condvar source and header cstyle clean.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#456
Stock Linux 2.6.32 and earlier kernels contained a broken version of
rwsem_is_locked() which could return an incorrect value. Because of
this compatibility code was added to detect the broken implementation
and replace it with our own if needed.
The fix for this issue was merged in to the mainline Linux kernel as
of 2.6.33 and the major enterprise distributions based on 2.6.32 have
all backported the fix. Therefore there is no longer a need to carry
this code and it can be removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#454
Under Illumos taskq_wait() returns when there are no more tasks
in the queue. This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete. New tasks
added whilst taskq_wait() is running will be ignored.
This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.
The previous behavior remains available through the
taskq_wait_outstanding() interface. Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#455
This patch only addresses the issues identified by the style checker
in spl-tsd.c. It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd
entry for the current thread. This is alright since tsd_get() returns NULL
when the entry doesn't exist.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#443
C type coercion rules require that negative numbers be converted into
positive numbers via wraparound such that a negative -1 becomes a
positive 1. This causes vn_getf to return a file handle when it should
return NULL whenever a positive file descriptor existed with the same
value. We should check for a negative file descriptor and return NULL
instead.
This was caught by ClusterHQ's unit testing.
Reference:
http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#450
When layered on XFS the following warning will be emitted under CentOS7
when entering vfs_fsync() with PF_FSTRANS already set. This is not an
issue for other stock Linux file systems and the warning was removed
for newer kernels. However, to avoid triggering this error PF_FSTRANS
is cleared and then reset in vn_fsync().
WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0
Call Trace:
[<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
[<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs]
[<ffffffff8114b833>] __writepage+0x13/0x50
[<ffffffff8114c341>] write_cache_pages+0x251/0x4d0
[<ffffffff8114c60d>] generic_writepages+0x4d/0x80
[<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8114d68e>] do_writepages+0x1e/0x40
[<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80
[<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs]
[<ffffffff811df54b>] vfs_fsync+0x2b/0x40
[<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl]
[<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs]
[<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs]
[<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs]
[<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs]
[<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs]
[<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs]
[<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs]
[<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0
[<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0
[<ffffffff815f3219>] system_call_fastpath+0x16/0x1b
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Avoid deadlocks when entering the shrinker from a PF_FSTRANS context.
This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS. Its
use has been deprecated within ZFS as it was an ineffective mechanism
to eliminate deadlocks. Among other things, it introduced the need for
strict ordering of mutex locking and unlocking in order that the
PF_FSTRANS flag wouldn't set incorrectly.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#446
Provide a Redhat specific spl-kmod.spec file which uses the old style
kmods (not kmods2) packaging. By using the provided kmodtool script
packages can be built which support weak modules. This allows for the
kernel to be updated without having to rebuild the SPL kernel modules.
Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:
$ ./configure --with-spec=redhat
$ make rpms
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Originally it was thought that custom spec files might be required
for Fedora. Happily that has turns out not to be the case. Since
this directory just contains symlinks to the generic spec files it
can be removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of automake 1.14.2, currently shipped with Ubuntu 14.04, automake
warns about AM_INIT_AUTOMAKE having more than one argument:
configure.ac:41: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated. For more info, see:
configure.ac:41: http://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation
This commit fixes the warnings by following above link's advice, so
AM_INIT gets called with the package's name and version. As both are
defined in the META file we're parsing it with `grep`, `cut` and `tr`.
NOTE: autoconf < 1.14 not supporting m4_esyscmd_s so m4_esyscmd was
used and modified `tr` to truncate newlines, too.
Signed-off-by: Hajo M<C3><B6>ller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#438
If kernel lock debugging is enabled, the fs_struct structure exceeds the
typical 1024 byte limit of CONFIG_FRAME_WARN and isn't enabled when it
otherwise should be.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#440
Also add support for the "name" parameter in mutex_init(). The name
allows for better diagnostics, namely in /proc/lock_stats when
lock debugging is enabled. Nested mutexes are necessary to support
CONFIG_PROVE_LOCKING. ZoL can use mutex_enter_nested()'s "class" argument
to to convey the locking hierarchy.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#439
Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl(). To address
this the tq_args have been moved to the heap.
cc1: warnings being treated as errors
spl-0.6.3/module/splat/splat-taskq.c:358:
error: the frame size of 1040 bytes is larger than 1024 bytes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held. The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.
1) It would require changes to a significant amount of the ZFS
code. This would complicate applying patches from upstream.
2) It would be easy to accidentally miss a critical region in
the initial patch or to have an future change introduce a
new one.
Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
To minimize the size of a kmutex_t a MUTEX_OWNER check was added.
It allowed the kmutex_t wrapper to leverage the mutex owner which was
already stored in the mutex for certain kernel configurations.
The upside to this was that it reduced the size of the kmutex_t wrapper
structure by the size of a task_struct pointer (4/8 bytes). The
downside was that two mutex implementations needed to be maintained.
Depending on your exact kernel configuration the correct one would
be selected.
Over the years this solution worked but it could be fragile since it
depending heavily on assumed kernel mutex implementation details. For
example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to
be added when the kernel changed how the owner was stored. It also
made the code more complicated than it needed to be.
Therefore, in the name of simplicity and portability this optimization
is being retired. It will slightly increase the memory requirements
for a kmutex_t but only very slightly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
This patch only addresses the issues identified by the style checker
in mutex.h. It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.
As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985
Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#427
For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max. This
allows for more concurrency in the system and less contention of
the virtual address space. Generally, this is a good thing.
However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again. It will likely fail
in exactly the same way. A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#428
The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags. This ensures that they can be safely called in any
context and the correct flags will be used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#426
The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms. This simplifies the process of porting changes to
Linux from other platforms. This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.
The key reason this is now possible is due to the addition of
Linux features such as the thread-specific PF_FSTRANS bit which
was introduced for XFS.
This patch stack also performs some refactoring and cleanup
designed to make the code more maintainable and understandable.
Finally, in the context of making and testing these changes
several bugs were identified and resolved resulting in a
more robust implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#414
The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored. This is due to
kmalloc() being layered on the generic Linux slab caches. It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation. This is
done because large slabs must be allocated using vmalloc(). It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored. This can result in a deadlock.
Therefore, a deadlock detection strategy was implemented to deal with
this case. When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab. Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.
This logic was correct but an optimization in the code allowed for a
deadlock. Because only slabs backed by vmalloc() can deadlock in the
way described above. An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches. This had the
advantage of making it easy to distinguish these objects when they
were freed.
But this isn't strictly safe. If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either. This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.
The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches. Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.
For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable. This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked. However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is designed to improve the memory utilization of
slabs by more carefully setting their size. The way the code
currently works is problematic for slabs which contain large
objects (>1MB). This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.
The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator. Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space. However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity. In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).
Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.
This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches. But depending on the workload it may result in slabs
being allocated and freed more frequently. In practice, this
has been shown to be a better default for most workloads.
Also the maximum slab size has been reduced to 4MB on 32-bit
systems. Due to the limited virtual address space it's critical
the we be as frugal as possible. A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.
Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments. Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10. The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly. By
reducing it we allow the system to resolve the issue more quickly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spl-module-parameters(5) was not kept up to date. Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many people have noticed that the kmem cache implementation is slow
to release its memory. This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty. Unused
objects which are cached in the magazines will still prevent a slab
from being freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>