I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.
Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#513
Correct the redhat specfile so that working debuginfo rpms are created
for the kernel modules. The generic specfile already does the right
thing.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#4224
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#500Closes#504Closes#505
The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers. This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#524
In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#523
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.
We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.
This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.
Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106
The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation. Add it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#512
The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.
Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.
This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#515
Issue zfsonlinux/zfs#4111
This patch provides 2 new kstats to display task queues:
/proc/spl/taskqs-all - Display all task queues
/proc/spl/taskqs - Display only "active" task queues
A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.
If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.
If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.
If the task queue has any waiters, displays each waiting task's pid.
Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
This reverts commit 61bbbd9a77 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507
This patch only addresses the issues identified by the style checker.
It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#506
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking. This is
a false positive.
One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch
The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads. The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock. Without subclassing, lockdep believes these
could potentially be the same lock and complains. A similar case occurs
when taskq_dispatch() then calls task_alloc().
This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:
subclass taskq
TQ_LOCK_DYNAMIC dynamic_taskq
TQ_LOCK_GENERAL any other
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep. The warning is a false positive.
The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file. The inode
lock is taken on both the parent inode and the file inode.
VFS provides an enum to subclass the lock. This patch changes the
spin_lock call to _nested version and uses the provided enum.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
When lockdep detects these conditions, it disables further lock analysis
for all locks. This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.
This commit creates a new type of mutex, MUTEX_NOLOCKDEP. This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().
This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.
MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
The SPL fails to build with some "Configured" kernels (ex. openSUSE
xen Kernel) this change should make same binaries with C compiler
optimization.
Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#510
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048
This reverts commit a430c11f0b. Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.
This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#500
If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.
We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#492
This was originally in e80cd06b8e, but somehow
was changed and not working anymore. And it will cause the following error:
modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#501
Useful when looking for the info on ZFS/SPL related memory consumption.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#460
Replace DKIOCTRIM with DKIOCFREE and add additional support required
for Nextenta's TRIM support.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#469
This is needed for architectures that do not have a builtin prefetchw()
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#502
Adding VPATH support, commit 37d7cd9, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.
The code which adds the `spl/` directory to the top level
Makefile was failing due to the addition of the `certs/` path.
The search pattern has been adjusted to be more tolerant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #481
Issue #498
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests. Large values can result in out of memory errors
for systems with less the 512M of memory. Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes#488
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.
For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy
This will easily deadlock before this patch.
Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#496
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes#487
Allocate a kmem cache magazine for every possible CPU which might
be added to the system. This ensures that when one of these CPUs
is enabled it can be safely used immediately.
For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint. In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#482
If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner
field, so we don't need our own.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.
Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 4cdcdbf which verifies the existence of the
spl.release build product would never be true.
This patch resolves the issue by updating the conditional to check
in the explicitly installed spl_config.h file for the version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#478
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.
Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.
Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes#474
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing. An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock. The patch will need to be reworked.
VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.
This patch attempts to create a new thread for recursive dispatch when
none are available.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#472
Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".
Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes#468
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error. Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface. As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.
vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653
Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes#471
This patch reverts 77ab5dd. This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places. Those places have subsequently been updated to use
the Linux equivalent Linux functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637