Commit Graph

355 Commits

Author SHA1 Message Date
Richard Yao
50a0749eba Linux 3.13 compat: Pass NULL for new delegated inode argument
This check was originally added for SLES10, a093c6a, to check for
a 'struct vfsmount *' argument which they added.  However, since
SLES10 is based on a 2.6.16 kernel which is no longer supported
this functionality was dropped.  The checks were refactored to
support Linux 3.13 without concern for historical versions.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:37:49 -08:00
Richard Yao
3e96de17d7 Linux 3.13 compat: Remove unused flags variable from __cv_init()
GCC 4.8.1 complained about an unused flags variable when building
against Linux 2.6.26.8:

/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:
In function ‘__cv_init’:
/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:39:6:
error: variable ‘flags’ set but not used
[-Werror=unused-but-set-variable]
  int flags = KM_SLEEP;
        ^
	cc1: all warnings being treated as errors

Additionally, the superfluous code uses a preempt_count variable that is
no longer available on Linux 3.13. Deleting the unnecessary code fixes a
Linux 3.13 compatibility issue.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:11:19 -08:00
Ned Bass
184c687387 Emulate illumos interface cv_timedwait_hires()
Needed for Illumos #3582. This interface is supposed to support
a variable-resolution timeout with nanosecond granularity.  This
implementation rounds up to microsecond resolution, as nanosecond-
precision timing is rarely needed for real-world performance
tuning and may incur unnecessary busy-waiting.  usleep_range() is
used if available, otherwise udelay() or msleep() are used
depending on the length of the delay interval.

Add flags from sys/callo.h as these are used to control the behavior of
cv_timedwait_hires().  Specifically,

CALLOUT_FLAG_ABSOLUTE
    Normally, the expiration passed to the timeout API functions is
    an expiration interval. If this flag is specified, then it is
    interpreted as the expiration time itself.

CALLOUT_FLAG_ROUNDUP
    Roundup the expiration time to the next resolution boundary. If this
    flag is not specified, the expiration time is rounded down.

References:
    https://www.illumos.org/issues/3582
    illumos/illumos-gate@0689f76

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #304
2013-11-04 09:49:24 -08:00
Ned Bass
f483a97a41 3537 add kstat_waitq_enter and friends
These kstat interfaces are required to port
"Illumos #3537 want pool io kstats" to ZFS on Linux.

kstat_waitq_enter()
kstat_waitq_exit()
kstat_runq_enter()
kstat_runq_exit()

Additionally, zero out the ks_data buffer in __kstat_create() so
that the kstat_io_t counters are initialized to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:41:52 -07:00
Cyril Plisko
ffbf0e57c2 Kstat to use private lock by default
While porting Illumos #3537 I found that ks_lock member of kstat_t
structure is different between Illumos and SPL. It is a pointer to
the kmutex_t in Illumos, but the mutex lock itself in SPL.
Apparently Illumos kstat API allows consumer to override the lock
if required. With SPL implementation it is not possible anymore.

Things were alright until the first attempt to actually override
the lock. Porting of Illumos #3537 introduced such code for the
first time.

In order to provide the Solaris/Illumos like functionality we:
  1. convert ks_lock to "kmutex_t *ks_lock"
  2. create a new field "kmutex_t ks_private_lock"
  3. On kstat_create() ks_lock = &ks_private_lock

Thus if consumer doesn't care we still have our internal lock in use.
If, however, consumer does care she has a chance to set ks_lock to
anything else before calling kstat_install().

The rest of the code will use ks_lock regardless of its origin.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
2013-10-25 13:41:30 -07:00
Brian Behlendorf
ce07767f79 Revert "Add KSTAT_TYPE_TXG type"
This reverts commit dba79fcbf2 in
favor of using the generic KSTAT_TYPE_RAW callbacks.  The advantage
of this approach is that arbitrary types can be added without the
need to add them to the SPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Prakash Surya
56d40a686b Add callbacks for displaying KSTAT_TYPE_RAW kstats
The current implementation for displaying kstats of type KSTAT_TYPE_RAW
is rather crude. This patch attempts to enhance this handling by
allowing a kstat user to register formatting callbacks which can
optionally be used.

The callbacks allow the user to implement functions for interpreting
their data and transposing it into a character buffer. This buffer,
containing a string representation of the raw data, is then be displayed
through the current /proc textual interface.

Additionally the kstats are made writable because it's now possible
to provide a useful handler via the existing ks_update() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Brian Behlendorf
429fe89cee Consistently use local_irq_disable/local_irq_enable
It was observed that spl_kmem_cache_alloc() uses local_irq_save()
and saves the interrupt state in a local variable.  This would
normally be fine except that spl_kmem_cache_alloc() calls
spl_cache_refill() which re-enables interrupts.  It is then
possible that while interrupts are enabled the process is
rescheduled to a different cpu before being disable again.
This could result in us restoring the saved interrupt state
from one cpu to another.

What the consequences of this are aren't perfectly clear, but
this is clearly a bug and it has the potential to cause issues.
The code has been updated to just use local_irq_enable() and
local_irq_disable() to avoid this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 14:00:56 -07:00
Richard Yao
df2c0f1849 Replace current_kernel_time() with getnstimeofday()
current_kernel_time() is used by the SPLAT, but it is not meant for
performance measurement. We modify the SPLAT to use getnstimeofday(),
which is equivalent to the gethrestime() function on Solaris.
Additionally, we update gethrestime() to invoke getnstimeofday().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #279
2013-10-09 13:28:30 -07:00
Richard Yao
f7fd6ddd96 Linux 3.8 compat: Use kuid_t/kgid_t when required
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #260
2013-08-09 10:09:29 -07:00
Richard Yao
e3c4d44886 PaX/GrSecurity Linux 3.8.y compat: Use __no_const on struct ctl_table
The PaX team started constifying `struct ctl_table` as of their Linux
3.8.0 patchset. This lead to zfsonlinux/spl#225 and Gentoo bug #463012.

While investigating our options, I learned that there is a preprocessor
directive called CONSTIFY_PLUGIN that we can use to detect the presence
of the PaX changes and adjust the code accordingly.

The PaX Team had suggested adopting ctl_table_no_const, but supporting
older kernels required declaring that whenever the CONSTIFY_PLUGIN was
set. Future compiler changes could potentially cause that to break in
the presence of -Werror, so instead we define our own spl_ctl_table
typdef and use that. This should be compatible with all PaX kernels.

This introduces a Linux kernel version number check to prevent a build
failure on versions of the PaX GCC plugin that existed for kernels
before Linux 3.8.0. Affected versions of the PaX plugin will trigger a
compiler error when they see no_const cast on a non-constified
structure.  Ordinarily, we would need an autotools check to catch that.
However, it is safe to do a kernel version check instead of an autotools
check in this specific instance because the affected versions of the PaX
GCC plugin only exist for Linux kernels before 3.8.0 and the
constification of `struct ctl_table` by the PaX developers only occurs
in Linux 3.8.0 and later.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #225
2013-08-08 09:51:34 -07:00
Richard Yao
251e7a779b Fix race in spl_kmem_cache_reap_now()
The current code contains a race condition that triggers when bit 2 in
spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked
and another thread is concurrently accessing its magazine.

spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each
magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is
set. This is unsafe because there is one magazine per CPU and the
magazines are lockless, so it is impossible to guarentee that another
CPU is not using its magazine when this function is called.

The solution is to only touch the local CPU's magazine and leave other
CPU's magazines to other CPUs.

Reported-by: DHE
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #274
2013-08-08 09:14:41 -07:00
Richard Yao
ba06298072 Linux 3.11 compat: Replace num_physpages with totalram_pages
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.

This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:

http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html

Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #273
2013-08-08 09:14:29 -07:00
Brian Behlendorf
ceb3872825 Fix KMC_OFFSLAB type caches
Because spl_slab_size() was always returning -ENOSPC for caches of
type KMC_OFFSLAB the cache could never be created.  Additionally
the slab size is rounded up to a page which is what kv_alloc()
expects.  The kv_alloc() code will minimally allocate a page,
in the KMC_OFFSLAB case this could be reduced.

The basic regression tests kmem:slab_small, kmem:slab_large,
and kmem:slab_align regression were updated to test KMC_OFFSLAB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Closes #266
2013-07-30 15:39:23 -07:00
Brian Behlendorf
b9b3715346 Return -1 for generic kmem cache shrinker
It has been observed that it's possible to get in a state where
shrink_slabs() will spin repeated invoking the generic kmem cache
shrinker.  It fails to detect it's not making forward progress
reclaiming from the cache and doesn't give up.  To ensure this
never occurs we unconditionally return -1 after reclaiming what
we can.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes zfsonlinux/zfs#1276
Closes zfsonlinux/zfs#1598
Closes zfsonlinux/zfs#1432
2013-07-30 15:33:24 -07:00
James H
c47efbc7fd Modify gethrestime to use current_kernel_time()
This allows us to get nanosecond resolution. It also means
we use the same time source as utimensat(now) etc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #255
2013-07-15 09:17:19 -07:00
Brian Behlendorf
ab4e74cc38 Fix bogus kmem leak warning
Commit 5c7a036 correctly relocated the creation of a taskq
and the registraction of the kmem_cache_shrinker after the
initialization of the kmem tracking code.  However, the
cleanup of these structures was not done before the leak
checks in spl_kmem_fini().  This resulted in an incorrect
'kmem leaked' warning even though there was no actual leak.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1569
2013-07-10 15:08:22 -07:00
Brian Behlendorf
b1424adda5 Fix --enable-debug-kmem-tracking option
This code has gotten something stale and no longer builds cleanly
against modern kernels.  The two issues addressed here are as
follows:

* The hlist_*_rcu interfaces in the kernel have been relatively
  unstable.  Since this isn't performance critical code just use
  the long standing hlist_* variants.

* In older kernels the hash_ptr() function takes a 'void *' but
  in newer kernels it expects a 'const void *'.  To silence the
  compiler warnings about this explicitly cast it to a 'void *'.
  The memset function is a similar case but it always expects
  a 'void *'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #256
2013-07-09 09:23:54 -07:00
Richard Yao
f2a745c41d Linux 3.10 compat: Do not rely on struct proc_dir_entry definition
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.

Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:25:18 -07:00
Yuxuan Shui
79a7ab2581 Linux 3.10 compat: add missing include of linux/slab.h
Linux kernel commit torvalds/linux@0d01ff2 changes some
includes we were depending on through linux/proc_fs.h.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:21:28 -07:00
Yuxuan Shui
1ddf9722dc Linux 3.10 compat: replace PDE()->data with PDE_DATA()
Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to
PDE_DATA().  To handle this detect the prefered interface
and define a PDE_DATA() wrapper for consistency.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:14:21 -07:00
Tim Chase
5c7a0369e2 Fix --enable-debug-kmem-tracking option
Re-order initialization in spl_kmem_init to allow for kmem tracing
to work.  The spl_kmem_init function calls taskq_create prior to
initializing the tracking (calling spl_kmem_init_tracking).  Since
taskq_create uses kmem_alloc, NULL dereferences occur because the
global kmem_list hasn't had its next & prev pointers initialized yet.

This commit moves the calls to spl_kmem_init_tracking earlier in the
spl_kmem_init function in order that the subsequent kmem_alloc calls
(by taskq_create) work properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #243
2013-06-18 11:40:33 -07:00
Brian Behlendorf
99c452bbba Fix taskq_wait_id()
The existing taskq_wait_id() function can incorrectly block
indefinitely.  Reimplement it more simply using wait_event()
in a similar fashion to taskq_wait_all().

This flaw was uncovered in the context of moving vn_rdwr() to
a taskq.  Previously taskq_wait_id() had no consumers outside
the SPLAT task framework which is why the issue went unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-03 14:32:29 -07:00
Jan Engelhardt
a9e86ac4fd gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:07:52 -07:00
Richard Yao
feaf1e321d Do not call cond_resched() in spl_slab_reclaim()
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2013-03-21 12:58:44 -07:00
Richard Yao
4a31e5aa9b Linux 3.9 compat: Switch to hlist_for_each{,_rcu}
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:34 -07:00
Richard Yao
8274ed5988 Drop support for 3 argument version of set_fs_pwd
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:31 -07:00
Richard Yao
a54718cfe0 Linux 3.9 compat: set_fs_root takes const struct path *
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:29 -07:00
Richard Yao
2a305c34c8 Linux 3.9 compat: vfs_getattr takes two arguments
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.

torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:26 -07:00
Richard Yao
bc90df6688 Linux 3.9 compat: Do not depend on f_vfsmnt
torvalds/linux@182be68478 removed the
preprocessor definition for f_vfsmnt. The ability to access the
mountpoint via ->f_path.mnt has been stable for a long time, so we
switch to that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:23 -07:00
Ned Bass
3d6af2dd6d Refresh links to web site
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-04 19:09:34 -08:00
Brian Behlendorf
0298f3d67f Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/spl'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/spl' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:06 -08:00
Brian Behlendorf
4bf3909e51 Disable automatic log dumping
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity.  This was helpful
during the early development, but these days it is no longer
needed.  I haven't had to resort to this debug buffer to resolve
an issue for several years now.

Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.

Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).

This should be viewed as a first step towards either.

  a) Retiring this infrastructure and complexity entirely, or
  b) Integrating this logging more properly with ftrace.

As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-05 16:13:27 -08:00
Brian Behlendorf
6ef94aa67a Fix tsd_get/set() race with tsd_exit/destroy()
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock.  They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.

The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value.  To avoid this add the needed
locking to tsd_exit() and tsd_destroy().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2013-01-31 13:54:59 -08:00
Brian Behlendorf
0936c3449f Add spl_kmem_cache_expire module option
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior.  The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache.  On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.

This behavior is now configurable through the 'spl_kmem_cache_expire'
module option.  The value is a bit mask with the following meaning.

  0x1 - Solaris style cache aging eviction is enabled.
  0x2 - Linux style low memory eviction is enabled.

Both methods may be safely enabled simultaneously, but by default
both are disabled.  It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good.  It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes #210
2013-01-28 09:34:12 -08:00
Brian Behlendorf
84dd1f4f15 Remove spl_invalidate_inodes()
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
2013-01-17 11:40:47 -08:00
Brian Behlendorf
d4899f4747 kmem-cache: Fix slab ageing soft lockup
Commit a10287e00d slightly reworked
the slab ageing code such that it is no longer dependent on the
Linux delayed work queue interfaces.

This was good for portability and performance, but it requires us
to use the on_each_cpu() function to execute the spl_magazine_age()
function.  That means that the function is now executing in interrupt
context whereas before it was scheduled in normal process context.
And that means we need to be slightly more careful about the locking
in the interrupt handler.

With the reworked code it's possible that we'll be holding the
skc->skc_lock and be interrupted to handle the spl_magazine_age()
IRQ.  This will result in a deadlock and soft lockup errors unless
we're careful to detect the contention and avoid taking the lock in
the interupt handler.  So that's what this patch does.

Alternately, (and slightly more conventionally) we could have used
spin_lock_irqsave() to prevent this race entirely but I'd perfer to
avoid disabling interrupts as much as possible due to performance
concerns.  There is absolutely no penalty for us not aging objects
out of the magazine due to contention.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes zfsonlinux/zfs#1193
2013-01-14 10:07:58 -08:00
Ned Bass
8842263bd0 call_usermodehelper() should wait for process
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with
this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 16:54:19 -08:00
Brian Behlendorf
1c7b3eaf87 RHEL 6.4 compat, fallocate()
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure.  Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().

Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.

To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 09:53:13 -08:00
Matt Johnston
46a75aadb7 Add cv_wait_io() to account I/O time
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting.  The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #206
2013-01-07 10:29:26 -08:00
Brian Behlendorf
034f1b331e Fix spl_kmem_init_kallsyms_lookup() panic
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute.  To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#699
Closes zfsonlinux/zfs#859
2012-12-19 09:06:35 -08:00
Brian Behlendorf
33e94ef1dd kmem-cache: Use a taskq for async allocations
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.

This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code.  All that
is required is a thread context to perform the allocation in.  The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf
a10287e00d kmem-cache: Use taskqs for ageing
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces.  This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.

However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function.  The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu.  It then blocks
waiting for them to complete and promptly reclaims any free slabs.

When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable.  It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf
296a8e596d kmem-cache: spl_kmem_cache_create() may always sleep
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context.  In
practice there are no callers which ever do this.  This makes sense
since a cache is by design a long lived data structure.

To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context.  All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf
a5a98e7260 splat taskq:front: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:front splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test6_impl'
  error: the frame size of 1648 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf
94ff5d38e3 splat taskq:order: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:order splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test5_impl'
  error: the frame size of 1680 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf
3238e71763 splat taskq:cancel: Add test case
Add a test case for taskq_cancel_id() to verify it is working
properly.  Just like taskq:delay we start by dispatching 100
tasks.  However this time 1/3 of the tasks use taskq_dispatch()
and will be run immediately, and 2/3 use taskq_dispatch_delay().
The idea is to create a busy taskq with both active, pending,
and delayed tasks.

After all the items have been successfully dispatched the test
begins randomly canceling known task ids.  It will do this for
5 seconds randomly canceling a task id and then sleeping for a
few milliseconds.   The task being canceled may have already run,
still be on the pending list, or may be currently being executed
by a worker thread.  The idea is to ensure we catch any subtle
race conditions.

Once all the non-canceled tasks have completed we cross check
the number of tasks which ran with the number of tasks which
were successfully canceled.  Additionally, we verify that the
taskq_cancel_id() function never blocks longer than needed.
This time is bounded by the longest run time of the task which
was dispatched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:49 -08:00
Brian Behlendorf
2f35782620 splat taskq:delay: Add test case
Add a test case for taskq_dispatch_delay() to verify it is working
properly.  The test dispatchs 100 tasks to a taskq with random
expiration times spread over 5 seconds.  As each task expires and
gets executed by a worker thread it verifies that it was run at
the correct time.  Once all the delayed tasks have been executed
we double check that all the dispatched tasks were successful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf
d9acd930b5 taskq delay/cancel functionality
Add the ability to dispatch a delayed task to a taskq.  The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached.  To achieve this two
new functions were added.

* taskq_dispatch_delay() -

  This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument.  The caller should pass the
desired time the task should be executed as an absolute value in
jiffies.  The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.

* taskq_cancel_id() -

  Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task.  There are three possible
return values.

  0      - The task was found and canceled before it was executed.
  ENOENT - The task was not found, either it was already run or an
           invalid task id was supplied by the caller.
  EBUSY  - The task is currently executing any may not be canceled.
           This function will block until the task has been completed.

* taskq_wait_all() -

  The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior.  It is only curreny used by
the splat taskq regression tests.

* taskq_wait_id() -

  Historically, the only difference between this function and
taskq_wait() was that you passed the task id.  In both functions you
would block until ALL lower task ids which executed.  This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.

  To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.

This is actually a fairly low risk change for a few reasons.

* Only new ZFS callers will make use of the new interfaces and
  very little common code was changed to support the new functions.

* The existing taskq_wait() implementation was not changed just
  slightly refactored.

* The newly optimized taskq_wait_id() implementation was never
  used by ZFS we can't accidentally introduce a new bug there.

NOTE: This functionality does not exist in the Illumos taskqs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf
aed8671cb0 taskq style, remove #define wrappers
When the taskq implementation was originally written I wrapped all
the API functions in #define's.  This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.

However, in practice the taskq symbols never conflicted.  The only
major conflicts occured with the kmem cache API.  Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf
472a34caff taskq style, convert spaces to soft tabs
Update the taskq implementation to conform with the style used
throughout the rest of the code.  There are no functional
changes in this commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Steven Johnson
794f145bf9 splat linux:shrinker: Fix fail-safe
Ensure the fail-safe is reset between successive tests.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:04:29 -08:00
Steven Johnson
ca072ee70f splat linux:shrinker: Fix race condition
Ensure the test thread blocks until the shrinker has completed its
work.  This is done by putting the test thread to sleep and waking
it each time the shrinker callback runs.  Once the shrinker size
drops to zero or we time out the test is allowed to proceed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #96
Closes #125
Closes #182
2012-12-12 09:04:11 -08:00
Steven Johnson
9b88fa165f splat taskq:front: Fix race
The taskq:front test has a race condition where task 4 and 8
race to complete, due to an incorrectly calculated set of delay
"factors" (T). If task 4 wins and actually finishes first, the
verification of the order of completion will fail.

The delays calculated to order task completion do not take into
account the terminal line in the table, and so are all off by
a factor of 1. This causes all the tasks in all queues to finish
sooner than expected and the accumulated error is the root cause
of tasks 4 and 8 racing to complete first. Before the change the
"actual" table looks like I commented in #130.

I changed:

* the table in the comment to correctly reflect the test and the
  factor timings needed.
* the individual task delay factors of T so that ONLY 1 task will
  every 2T. (on average)
* 1T was reduced from 100ms to 50ms. This halves the duration of
  the test and makes any remaining raciness more likely to cause
  failures, but it did not cause the test to fail.
* simplified the delay factor logic by using a table look-up
  instead of a switch.
* Added a "task started" message so that with -v it is possible
  to see the order tasks are started.
* Moved the "task completed" message inside the spinlock so that
  with -v the message truly reflects the absolute order of
  completion as guaranteed by the spinlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #130
2012-12-05 12:23:40 -08:00
Brian Behlendorf
053678f3b0 Handle errors from spl_kern_path_locked()
When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added
by commit bcb1589 an entirely new vn_remove() implementation was
added.  That function did not properly handle an error from
spl_kern_path_locked() which would result in an panic.  This
patch addresses the issue by returning the error to the caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #187
2012-12-03 12:06:25 -08:00
Brian Behlendorf
b84412a6e8 Linux compat 3.7, kernel_thread()
The preferred kernel interface for creating threads has been
kthread_create() for a long time now.  However, several of the
SPLAT tests still use the legacy kernel_thread() function which
has finally been dropped (mostly).

Update the condvar and rwlock SPLAT tests to use the modern
interface.  Frankly this is something we should have done a
long time ago.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #194
2012-12-03 09:36:21 -08:00
Brian Behlendorf
043f9b5724 Disable FS reclaim when allocating new slabs
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks.  Therefore, we clear __GFP_FS
for these allocations.  The two deadlocks are:

* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
  calls kmem_cache_alloc() which happens to need to allocate a
  new slab.  To allocate the new slab we enter FS level reclaim
  and attempt to evict several inodes.  To evict these inodes we
  need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
  just happens that obj1 and obj2 use the same hashed lock.

* Similar to the first case however instead of getting blocked
  on the hash lock we block in txg_wait_open() which is waiting
  for the next txg which isn't coming because the txg_sync
  thread is blocked in kmem_cache_alloc().

Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS.  However, it practice this is sufficient because
several very unlikely things must all occur concurrently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
2012-11-27 13:43:27 -08:00
Brian Behlendorf
dc1b30224f Never spin in kmem_cache_alloc()
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.

Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock.  When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 15:48:39 -08:00
Brian Behlendorf
a1af8fb1ea Optimize spl_kmem_cache_free()
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses.  It can be
easily determined if the passed object is a virtual slab object
or an emergency object.  This allows us to completely optimize
the emergency object free case out of the common free path.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf
ed3163484d Track emergency object in rbtree
In the initial implementation emergency objects were tracked on a
per-cache list.  The assumption was that under normal operation we
would never allocate more than a handful of these objects.  So the
cost of walking the list during free was expected to be negligible.

However real world usage has shown that emergency objects tend to
be allocated in batches.  A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.

Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf
165f13c33a Improved vmem cached deadlock detection
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks.  In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.

With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time.  Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects.  This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.

The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:15 -08:00
Brian Behlendorf
1112486356 splat kmem:slab_overcommit: Disabled
Disable this test because it may result in an OOM event on the
system which can result in the test infrastructure being killed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf
b8296bf3e6 splat atomic:64-bit: Create thread outside spin lock
The Fedora 3.6 debug kernel identified the following issue where
we create a thread under a spin lock.  This isn't safe because
sleeping could result in a deadlock.  Therefore the lock is changed
to a mutex so it's safe to sleep.

  BUG: sleeping function called from invalid context at mm/slub.c:930
  in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat
  1 lock held by splat/10583:

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf
0e149d4204 splat: Fix log buffer locking
The Fedora 3.6 debug kernel identified the following issue where
we call copy_to_user() under a spin lock().  This used to be safe
in older kernels but no longer appears to be true so the spin
lock was changed to a mutex.  None of this code is performance
critical so allowing the process to sleep is harmless.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf
df870a697f splat: Cleanup headers
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf
d2733258d0 Condition variable reference counts
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().

This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.

In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:55 -08:00
Brian Behlendorf
dba79fcbf2 Add KSTAT_TYPE_TXG type
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:17:40 -07:00
Brian Behlendorf
71c9f0b003 Make kstat.ks_update() callback atomic
Move the kstat ks_update() callback under the ks_lock.  This
enables dynamically sized kstats without modification to the
kstat API.

  * Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
  * Register a ->ks_update() callback which does:
    o Frees any existing ks_data buffer.
    o Set ks_data_size to the kstat array size.
    o Set ks_data to an allocated buffer of size ks_data_size
    o Populate the array of buffers with the required data.

The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer.  The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access.  To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-23 09:36:19 -07:00
Brian Behlendorf
1e0c2c2ccf Linux 3.7 compat, __clear_close_on_exec() removed
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.

Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests.  Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.

This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #183
2012-10-18 13:36:44 -07:00
Yuxuan Shui
bcb15891ab Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
2012-10-14 16:26:21 -07:00
Massimo Maggi
dea3505dff Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 16:22:29 -07:00
Etienne Dechamps
bbdc6ae495 Add interface for file hole punching.
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #168
2012-10-04 16:22:07 -07:00
Brian Behlendorf
3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf
9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf
330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Brian Behlendorf
3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf
95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf
cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Brian Behlendorf
46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf
efcd0ca32d Enhance SPLAT kmem:slab_overcommit test
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test.  These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented.  And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.

This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.

Since this touches some common code all the tests which share
these data structions were also updated.  After making these
changes slab_overcommit is reliably passing.  However, there
is certainly additional cleanup which could be done here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-30 15:49:00 -07:00
Brian Behlendorf
cd5ca4b2f8 Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf
500e95c884 Revert "Disable vmalloc() direct reclaim"
This reverts commit 2092cf68d8.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf
617f79de6a Revert "Fix NULL deref in balance_pgdat()"
This reverts commit b8b6e4c453.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf
bc03e07a7c Revert "Detect kernels that honor gfp flags passed to vmalloc()"
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address.  Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf
d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf
e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Prakash Surya
08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao
15d0411297 Remove Makefile from non-toplevel .gitignore files
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.

Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #152
2012-08-23 12:49:04 -07:00
Prakash Surya
9baf44bc17 Wrap trace_set_debug_header in trace_[get|put]_tcd
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #160
2012-08-23 10:01:20 -07:00
Richard Yao
6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Etienne Dechamps
a9f2397ee9 Determine the hostid on demand.
Currently, the SPL tries to determine the hostid at module load. The
hostid is usually determined by running the userland program "hostid"
during module initialization.

Unfortunately, when the module initializes, it may be way too soon to be
able to run any userland programs. This is especially true when the
module is compiled directly inside the kernel (built-in); in that case,
the SPL would try to run hostid when the kernel is still initializing,
which of course is doomed to fail.

This patch fixes the issue by deferring hostid generation until
something actually needs the hostid (that is, when zone_get_hostid() is
called), thus switching to a "on-initialization" model to a "on-demand"
(lazy loading) model. ZFS only needs the hostid when some pool
operations are requested, and this always happens way after the kernel
has finished initialization, thus solving the problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:14:02 -07:00
Etienne Dechamps
c167aadb27 Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building SPL as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the spl source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the spl
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:13:09 -07:00
Etienne Dechamps
38b5ff4d07 Fix undefined reference on spl_mutex_spin_max().
Commit 3160d4f56b changed the set of
conditions under which spl_mutex_spin_max would be implemented as a
function by changing an #if in sys/mutex.h. The corresponding
implementation file spl-mutex.c, however, has not been updated to
reflect the change. This results in undefined reference errors on
spl_mutex_spin_max under the following condition:

((!CONFIG_SMP || CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR)

This patch fixes the issue by using the same #if in sys/mutex.h and
spl-mutex.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:54:53 -07:00
Etienne Dechamps
94aac9c9bc Use MODULE variable in module Makefile like zfs.
In zfs, each module Makefile contains a MODULE variable which contains
the name of the module, and the following declarations reference this
variable.

In spl, there is a MODULES variable which is never used. Rename it to
MODULE and use it like in zfs. This improves consistency between the two
build systems.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:53:48 -07:00
Brian Behlendorf
e8267acd25 32-bit compat, hostid_read()
Explicitly cast the sizeof in hostid_read() to prevent the
following compiler warning on 32-bit systems.

  module/spl/spl-generic.c:490:10: error: format '%lu' expects
  argument of type 'long unsigned int', but argument 4 has type
  'unsigned int' [-Werror=format]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 11:14:04 -07:00
Richard Yao
36811b4430 Detect kernels that honor gfp flags passed to vmalloc()
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.

A kernel patch is available in the Gentoo bug tracker for
this issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #126
2012-07-11 11:44:27 -07:00
Richard Yao
973e8269bd Constify memory management functions
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #131
2012-07-03 16:07:27 -07:00
Brian Behlendorf
44e406d712 PowerPC Compatibility
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #119
2012-07-02 09:33:09 -07:00
Richard Yao
e0093fea58 Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree.  This
causes the following failure:

  error: implicit declaration of function '__FD_CLR'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current
__clear_close_on_exec() interface for readability.  Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #124
2012-06-13 16:18:51 -07:00
Brian Behlendorf
eaac9ba510 Fix uninit variable in slab reclaim test
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized.  In practice this will never occur but
to keep gcc happy we initialize the variable to zero.

Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
2012-06-13 16:17:22 -07:00