The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made. Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations. However, since we are using none
of that infrastructure we're responsible for incrementing this
count.
If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked. If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.
This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage. In addition,
it should allow for snappier reclaim under memory pressure.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#102
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.
Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#101
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM. All that is
required is to make the symbols available to the linker with the
expected names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of the removal of the taskq work list made in commit:
commit 2c02b71b14
Author: Prakash Surya <surya1@llnl.gov>
Date: Mon Dec 5 17:32:48 2011 -0800
Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
Long ago I added support to the spl for condition variable names
because I thought they might be needed. It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.
cv_init(9F)
Parameters
name - Descriptive string. This is obsolete and should be
NULL. (Non-NULL strings are legal, but they're a
waste of kernel memory.)
Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Improve the distribution detection by moving the tests for
distribution specific files first. The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique. This is particularly true in the case of Debian
since so many distributions are based on Debian.
Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
'--with debug-log' - Enables the internal debug log
'--with debug-kmem' - Enables basic memory accounting
'--with debug-kmem-tracking' - Enables detailed memory tracking
# For example:
$ rpmbuild --rebuild --with debug spl-modules-0.6.0-rc6.src.rpm
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When building the spl with --disable-debug-log the __SDEBUG()
macro and spl_debug_* helper functions were undefined. This
change adds the missing functions so the upper layers compiling
against the spl don't need to be aware of how the spl was built.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the bare minimum functionality to support dynamic kstats. A
complete kstat implementation should be done as part of issue #84.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s. This patch clarifies things
by cleanly breaking the two subsystem apart. The result of this
is the following behavior.
--enable-debug - Enable/disable code wrapped in ASSERT()s.
--disable-debug ASSERT()s are used to check invariants and
are never required for correct operation.
They are disabled by default because they
may impact performance.
--enable-debug-log - Enable/disable the debug log infrastructure.
--disable-debug-log This infrastructure allows the spl code and
its consumer to log messages to an in-kernel
log. The granularity of the logging can be
controlled by a debug mask. By default the
mask disables most debug messages resulting
in a negligible performance impact. Because
of this the debug log is enabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq
Taskq threads need only add themselves to the work wait queue if
there are no pending work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
This reverts commit ec2b41049f.
A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:
1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps
This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive(). The handler would
run immediately within the race window.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
This change updates the rpm spec files to have strictly correct
package dependencies. That means a few things:
* Add a dependency to the spl package for the spl-modules package.
This ensures that when running 'yum install spl' that newest
version of the spl-modules will be installed.
* Remove the redundant distribution release extension. This
is already added once because it is part of the kernel package
release name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the original build system code was added the release
component was accidentally omited from the development header
install path. This patch adds the missing path component so
it's always clear exactly what release your compiling against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq outside of tq_lock
Taskq threads need only add themselves to the work wait queue if there
are no pending work items. Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.
3) Call wake_up() outside of tq->tq_lock
Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().
A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes. The following table summarizes the
results using data from the kernel lock profiler.
tq_lock time %diff Wall clock (s) %diff
original: 39117614.10 0 41.72 0
exclusive threads: 31871483.61 18.5 34.2 18.0
unlocked add/rm waitq: 13794303.90 64.7 16.17 61.2
unlocked wake_up(): 1589172.08 95.9 16.61 60.2
Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows. %diff is calculated with respect to "original" as
100*(orig-new)/orig.
Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time. This is because the point of contention shifts from
the taskq lock to the wait queue lock. But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#32
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue. This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example. This test should always pass. Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651
Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f ]` expression that
incorrectly evaluates to true.
# ./configure --with-config=user
# make distclean
Making distclean in module
make[1]: Entering directory `/spl/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/spl/module: No such file or directory. Stop.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, SPL_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#517
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.
Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:
warning: passing argument 1 of ‘spinlock_check’
from incompatible pointer type [enabled by default]
note: expected ‘struct spinlock_t *’
but argument is of type ‘struct raw_spinlock_t *’
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76Closes: zfsonlinux/zfs#463
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check(). In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers. Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#58
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.
Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#72
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.
In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).
In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.
Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#71
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.
Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.
Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).
To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface. If the
old interface passes each of its tests, the new interface is
exercised. Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#65
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit. It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq. This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.
commit 5aeb94743e3be0c51e86f73096334611ae3a058e
Author: Garrett D'Amore <garrett@nexenta.com>
Date: Wed Jul 27 07:13:44 2011 -0700
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.
The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes. Previously these fields were private
so the name never mattered.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#64
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#70
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them. Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
The splat_taskq_test4_common function was incorrectly referencing
the splat_taskq-test13_func symbol, when it meant to be using the
splat_taskq_test4_func symbol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#61
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict. Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#56
The depmod utility from module-init-tools 3.12-pre3 generates a
warning when the -e option is used without -E or -F. This was
observed under OpenSuse 11.4. To resolve the issue when the
exact System.map-* for your kernel cannot be found fallback to
a generic safe '/sbin/depmod -a'.
WARNING: -e needs -E or -F
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed. This same task is now accomplished
more cleanly with per super block shrinkers. This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.
This support has always been entirely optional. So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.
The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced. However, the current zfs
implementation in this regard is already flawed and needs to
be reworked. If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Preferentially use the vfs_fsync() function. This function was
initially introduced in 2.6.29 and took three arguments. As
of 2.6.35 the dentry argument was dropped from the function.
For older kernels fall back to using file_fsync() which also
took three arguments including the dentry.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules. As of Linux 3.1 it is now longer easily
available. To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
kv_alloc() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #287
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
This would cause problems when using 'zfs send' with a file as the
target (rather than a pipe or a socket as is usually the case) as
for each write the destination offset in the file would be 0.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #391
The URL field in the spl-modules and spl package spec files were
updated to point to the ZFS on Linux repository hosted by github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>