Commit Graph

191 Commits

Author SHA1 Message Date
Ned Bass
3c6ed5410b Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq

Taskq threads need only add themselves to the work wait queue if
there are no pending work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:49 -08:00
Ned Bass
0bb43ca282 Revert "Taskq locking optimizations"
This reverts commit ec2b41049f.

A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:

1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps

This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive().  The handler would
run immediately within the race window.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:39 -08:00
Ned Bass
ec2b41049f Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq outside of tq_lock

Taskq threads need only add themselves to the work wait queue if there
are no pending work items.  Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.

3) Call wake_up() outside of tq->tq_lock

Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().

A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes.  The following table summarizes the
results using data from the kernel lock profiler.

                        tq_lock time    %diff   Wall clock (s)  %diff
original:               39117614.10     0       41.72           0
exclusive threads:      31871483.61     18.5    34.2            18.0
unlocked add/rm waitq:  13794303.90     64.7    16.17           61.2
unlocked wake_up():     1589172.08      95.9    16.61           60.2

Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows.  %diff is calculated with respect to "original" as
100*(orig-new)/orig.

Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time.  This is because the point of contention shifts from
the taskq lock to the wait queue lock.  But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #32
2012-01-18 10:36:57 -08:00
Ned Bass
cf5d23fa1e Add taskq contention splat test
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue.  This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example.  This test should always pass.  Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-18 10:36:51 -08:00
Darik Horn
966e5200d3 Fix make distclean for --with-config=user
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651

Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f  ]` expression that
incorrectly evaluates to true.

  # ./configure --with-config=user
  # make distclean

  Making distclean in module
  make[1]: Entering directory `/spl/module'
  make -C  SUBDIRS=`pwd`  clean
  make: Entering an unknown directory
  make: *** SUBDIRS=/spl/module: No such file or directory.  Stop.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-17 10:06:00 -08:00
Brian Behlendorf
5f6c14b1ed Proxmox VE kernel compat, invalidate_inodes()
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check().  In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers.  Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #58
2011-12-21 14:29:45 -08:00
Prakash Surya
8f2503e0af Store copy of tqent_flags prior to servicing task
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.

In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).

In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.

Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #71
2011-12-16 16:54:00 -08:00
Prakash Surya
e7e5f78e7b Swap taskq_ent_t with taskqid_t in taskq_thread_t
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.

Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.

Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).

To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
2011-12-16 13:26:54 -08:00
Prakash Surya
699d5ee8a9 Exercise new taskq interface in splat-taskq tests
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface.  If the
old interface passes each of its tests, the new interface is
exercised.  Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #65
2011-12-13 16:10:57 -08:00
Prakash Surya
44217f7aad Implement taskq_dispatch_prealloc() interface
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit.  It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq.  This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.

    commit 5aeb94743e3be0c51e86f73096334611ae3a058e
    Author: Garrett D'Amore <garrett@nexenta.com>
    Date:   Wed Jul 27 07:13:44 2011 -0700

    734 taskq_dispatch_prealloc() desired
    943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya
ac1e5b6033 Add Test: "Single task queue, recursive dispatch"
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya
2c02b71b14 Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.

The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.

The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:50 -08:00
Prakash Surya
046a70c93b Replace struct spl_task with struct taskq_ent
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes.  Previously these fields were private
so the name never mattered.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 12:28:09 -08:00
Prakash Surya
ed948fa72b Add SPLAT_TEST_FINI call for SPLAT_TASKQ_TEST6_ID
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #64
2011-12-13 12:26:16 -08:00
Prakash Surya
e05bec805b Fix a typo referencing an incorrect symbol
The splat_taskq_test4_common function was incorrectly referencing
the splat_taskq-test13_func symbol, when it meant to be using the
splat_taskq_test4_func symbol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #61
2011-11-21 16:52:36 -08:00
Brian Behlendorf
1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf
fe71c0e567 Linux 3.1 compat, shrink_*cache_memory
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed.  This same task is now accomplished
more cleanly with per super block shrinkers.  This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.

This support has always been entirely optional.  So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.

The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced.  However, the current zfs
implementation in this regard is already flawed and needs to
be reworked.  If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:30 -08:00
Brian Behlendorf
12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Brian Behlendorf
b8b6e4c453 Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
kv_alloc() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #287
2011-11-03 09:50:22 -07:00
Gunnar Beutner
f3989ed322 vn_rdwr() didn't properly advance the file position
This would cause problems when using 'zfs send' with a file as the
target (rather than a pipe or a socket as is usually the case) as
for each write the destination offset in the file would be 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #391
2011-10-18 16:51:35 -07:00
Brian Behlendorf
ecc3981007 Fix various typos in comments
Just clean up some of the typos and spelling mistakes in the
comments of spl-kmem.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:32:49 -07:00
Gunnar Beutner
8d177c181f Fixed typo in spl_slab_alloc()
The typo did not have any effect (apart from a negligible performance
impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when
at least one bit in skc->skc_flags is set.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:03:43 -07:00
Gunnar Beutner
64c075c3f4 Properly destroy work items in spl_kmem_cache_destroy()
In a non-debug build the ASSERT() would be optimized away
which could cause pending work items to not be cancelled.

We must also use cancel_delayed_work_sync() rather than just
cancel_delayed_work() to actually wait until work items have
completed.  Otherwise they might accidentally access free'd
memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS bugs #279, #62, #363, #418
2011-10-11 09:59:19 -07:00
Gunnar Beutner
763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf
6b3b569df3 Remove /etc/hostid missing warning
No longer print the following warning to the console when the
/etc/hostid file is missing.  This is the expected default behavior.
Keeping the hostid in sync with the initramfs is now accomplished
by creating the /etc/hostid in the initramfs not on the system.

  SPL: The /etc/hostid file is not found.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-06 14:58:09 -07:00
Brian Behlendorf
e80cd06b8e Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:37:41 -07:00
Darik Horn
0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf
bf0c60c060 Add linux compatibility tests
While the splat tests were originally designed to stress test
the Solaris primatives.  I am extending them to include some kernel
compatibility tests.  Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.

Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory.  These are simply functional tests to
ensure we can call these functions safely.  Checking for correct
behavior is more difficult since other running processes will
influence the behavior.  However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.

Test 3 checks that shrinker functions are being registered and
called correctly.  As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
2011-06-21 14:02:46 -07:00
Brian Behlendorf
a55bcaad18 Linux 3.0: Shrinker compatibility
Update the the wrapper macros for the memory shrinker to handle
this 4th API change.  The callback function now takes a
shrink_control structure.  This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
2011-06-21 14:02:39 -07:00
Brian Behlendorf
372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Darik Horn
c95b308d12 Correct typos in the spl proc handler.
Correct a format typo that causes /proc/sys/kernel/spl/hostid
to return a decimal number instead of a hexadecimal number.
2011-04-24 20:56:07 -05:00
Darik Horn
5b8f76ea16 Make the SPL kernel messages consistent with ZFS.
Change the SPL kernel messages for module loading and module
unloading so that they are similar to the ZFS kernel messages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:13 -07:00
Darik Horn
ad35b6a6e9 Remove the gawk dependency.
This reverts commit 1814251453.

Demote the gawk call back to awk and ensure that stderr is attached.  GNU gawk
tolerates a missing stderr handle, but many utilities do not, which could be
why a regular awk call was unexplainably failing on some systems.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:09 -07:00
Darik Horn
fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf
3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf
b1cbc4610c Linux 2.6.39 compat, kern_path_parent()
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed.  The only behavior now
offered is that of LOOKUP_PARENT.  The spl already always passed
this flag so dropping the flag does not impact us.
2011-04-20 12:30:17 -07:00
Brian Behlendorf
83c623aa1a Linux 2.6.39 compat, DEFINE_SPINLOCK()
This is a long over due compatibility change.  Way, way, way back
in 2007 there was a push to remove all consumers of SPIN_LOCK_UNLOCKED.
Finally, in 2011 with 2.6.39 all the consumers have been updated
and SPIN_LOCK_UNLOCKED was removed.  It's about time we use the
new API as well, this change does exactly that.  DEFINE_SPINLOCK()
was available as far back as 2.6.12 so there doesn't need to be
any additional autoconf-foo for this change.
2011-04-20 12:01:11 -07:00
Brian Behlendorf
98e2afd1c5 Fix unused variable
Flagged by the default -Wunused-but-set-variable gcc option when
running under Fedora 15.  Since it's correct this variable is
entirely unused this commit removes it.
2011-04-19 09:45:36 -07:00
Brian Behlendorf
9b0f9079d2 Linux 2.6.39 compat, invalidate_inodes()
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes().  This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block.  When only the legacy API is available the second
argument will be dropped for compatibility.
2011-04-19 09:08:08 -07:00
Brian Behlendorf
e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
d0a1038ff3 Update /proc/spl/kmem/slab output
The 'slab_fail', 'slab_create', and 'slab_destroy' columns in the slab
output have been removed because they are virtually always zero and
not very useful.

The much more useful 'size' and 'alloc' columns have been added which
show the total slab size and how much of the total size has been
allocated to objects.

Finally, the formatting has been updated to be much more human
readable while still being friendly for tool like awk to parse.
2011-04-06 20:06:03 -07:00
Brian Behlendorf
495bd532ab Linux shrinker compat
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header.  The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.

SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
2011-04-06 20:06:03 -07:00
Brian Behlendorf
734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf
2092cf68d8 Disable vmalloc() direct reclaim
As part of vmalloc() a __pte_alloc_kernel() allocation may occur.  This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur.  This reclaim can trigger file IO which
can result in a deadlock.  This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.

An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3).  This code can be properly autoconf'ed
when the upstream bug is fixed.

1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
2011-03-20 15:12:08 -07:00
Brian Behlendorf
47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf
19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf
5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf
914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf
d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00