mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 02:44:41 +03:00

Author	SHA1	Message	Date
Chunwei Chen	39cd90ef08	Add cv_timedwait_sig_hires to allow interruptible sleep Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #548	2016-05-12 14:54:15 -07:00
Tim Chase	ea2633ad26	Clear PF_FSTRANS over spl_filp_fallocate() The problem described in `2a5d574` also applies to XFS's file or inode fallocate method. Both paths may trigger writeback and expose this issue, see the full stack below. When layered on XFS a warning will be emitted under CentOS7 when entering either the file or inode fallocate method with PF_FSTRANS already set. To avoid triggering this error PF_FSTRANS is cleared and then reset in vn_space(). WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0 Call Trace: [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs] [<ffffffff81173ed7>] __writepage+0x17/0x40 [<ffffffff81176f81>] write_cache_pages+0x251/0x530 [<ffffffff811772b1>] generic_writepages+0x51/0x80 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs] [<ffffffff81177300>] do_writepages+0x20/0x30 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs] [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs] [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl] [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs] [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs] [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs] [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl] [<ffffffff810c1777>] kthread+0xd7/0xf0 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Nikolay Borisov <kernel@kyup.com> Closes zfsonlinux/zfs#4529	2016-04-26 11:22:43 -07:00
Tim Chase	7bb5d92de8	Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another thread to be spawned. This will cause TQ_NOQUEUE to behave similarly as it does with non-dynamic taskqs. Add support for TQ_NOQUEUE to taskq_dispatch_ent(). Signed-off-by: Tim Chase <tim@onlight.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #530	2016-03-17 09:52:35 -07:00
Brian Behlendorf	a6ae97caed	Add rw_tryupgrade() This implementation of rw_tryupgrade() behaves slightly differently from its counterparts on other platforms. It drops the RW_READER lock and then acquires the RW_WRITER lock leaving a small window where no lock is held. On other platforms the lock is never released during the upgrade process. This is necessary under Linux because the kernel does not provide an upgrade function. There are currently no callers in the ZFS code where this change in behavior is a problem. In fact, in most cases the code is already written such that if the upgrade fails the RW_READER lock is dropped and the caller blocks waiting to acquire the lock as RW_WRITER. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Matthew Thode <prometheanfire@gentoo.org> Closes zfsonlinux/zfs#4388 Closes #534	2016-03-10 13:05:25 -08:00
Richard Yao	0b43696e66	random_get_pseudo_bytes() need not provide cryptographic strength entropy Perf profiling of dd on a zvol revealed that my system spent 3.16% of its time in random_get_pseudo_bytes(). No SPL consumers need cryptographic strength entropy, so we can reduce our overhead by changing the implementation to utilize a fast PRNG. The Linux kernel did not export a suitable PRNG function until it exported get_random_int() in Linux 3.10. While we could implement an autotools check so that we use it when it is available or even try to access the symbol on older kernels where it is not exported using the fact that it is exported on newer ones as justification, we can instead implement our own pseudo-random data generator. For this purpose, I have written one based on a 128-bit pseudo-random number generator proposed in a paper by Sebastiano Vigna that itself was based on work by the late George Marsaglia. http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf Profiling the same benchmark with an earlier variant of this patch that used a slightly different generator (roughly same number of instructions) by the same author showed that time spent in random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50 improvement. This particular generator algorithm is also well known to be fast: http://xorshift.di.unimi.it/#speed The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14 GBps of throughput on an Intel Core i7-4770 in what is presumably a single-threaded context. Using it in `random_get_pseudo_bytes()` in the manner I have will probably not reach that level of performance, but it should be fairly high and many times higher than the Linux `get_random_bytes()` function that we use now, which runs at 16.3 MB/s on my Intel Xeon E3-1276v3 processor when measured by using dd on /dev/urandom. Also, putting this generator's seed into per-CPU variables allows us to eliminate overhead from both spin locks and CPU memory barriers, which is NUMA friendly. We could have alternatively modified consumers to use something like `gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but that has a few potential problems that this approach avoids: 1. Switching to `gethrtime() % 3` in hot code paths today requires diverging from illumos-gate and does nothing about potential future patches from illumos-gate that call our slow `random_get_pseudo_bytes()` in different hot code paths. Reimplementing `random_get_pseudo_bytes()` with a per-CPU PRNG avoids both of those things entirely, which means less work for us in the future. 2. Looking at the code that implements `gethrtime()`, I think it is unlikely to be faster than this per-CPU PRNG implementation of `random_get_pseudo_bytes()`. It would be best to go with something fast now so that there is no point in revisiting this from a performance perspective. 3. `gethrtime() % 3` can vary in behavior from system to system based on kernel version, architecture and clock source. In comparison, this per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()` that should behave consistently across all systems regardless of kernel version, system architecture or machine clock source. It is unlikely that we would ever need to revisit this per-CPU PRNG while the same cannot be said for `gethrtime() % 3`. 4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions depending on the clock source, so replacing `random_get_pseudo_bytes()` with `gethrtime()` in hot code paths could still require a future person working on NUMA scalability to reimplement it anyway while this per-CPU PRNG would not by virtue of using neither CPU memory barriers nor atomic instructions. Note that I did not check various clock sources for the presence of atomic instructions. There is simply too much code to read and given the drawbacks versus this per-cpu PRNG, there is no point in being certain. 5. I have heard of instances where poor quality pseudo-random numbers caused problems for HPC code in ways that took more than a year to identify and were remedied by switching to a higher quality source of pseudo-random numbers. While filesystems are different than HPC code, I do not think it is impossible for us to have instances where poor quality pseudo-random numbers can cause problems. Opting for a well studied PRNG algorithm that passes tests for statistical randomness over changing callers to use `gethrtime() % 3` bypasses the need to think about both whether poor quality pseudo-random numbers can cause problems and the statistical quality of numbers from `gethrtime() % 3`. 6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is probably not a huge issue, but anyone using kgdb would never be able to step through a seqlock critical section, which is not a problem either now or with the per-CPU PRNG: https://en.wikipedia.org/wiki/Seqlock The only downside that I can see is that this code's memory requirement is O(N) where N is NR_CPUS, versus the current code and `gethrtime() % 3`, which are O(1), but that should not be a problem. The seeds will use 64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of memory at the low end (i.e. `NR_CPU == 1`). In either case, we should only use a few hundred bytes of code for text, especially since `spl_rand_jump()` should be inlined into `spl_random_init()`, which should be removed during early boot as part of "Freeing unused kernel memory". In either case, the memory requirements are minuscule. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #372	2016-02-17 09:49:09 -08:00
Chunwei Chen	8f3b403a73	Allow kicking a taskq to spawn more threads This patch add a module parameter spl_taskq_kick. When writing non-zero value to it, it will scan all the taskq, if a taskq contains a task pending for more than 5 seconds, it will be forced to spawn a new thread. This is use as an emergency recovery from deadlock, not a general solution. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #529	2016-02-05 14:08:31 -08:00
Brian Behlendorf	6b38e7510f	Remove RLIM64_INFINITY assert in vn_rdwr() Previous commit `be29e6a` updated kobj_read_file() so it no longer unconditionally passes RLIM64_INFINITY. The vn_rdwr() function needs to be updated accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #513	2016-01-23 11:16:23 -08:00
Richard Yao	be29e6a6e6	kobj_read_file: Return -1 on vn_rdwr() error I noticed that the SPL implementation of kobj_read_file is not correct after comparing it with the userland implementation of kobj_read_file() in zfsonlinux/zfs#4104. Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr implementation did not support it anyway, so there is no difference. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #513	2016-01-23 10:10:44 -08:00
Chunwei Chen	16522ac290	Use tsd to store tq for taskq_member To prevent taskq_member holding tq_lock and doing linear search, thus causing contention. We store the taskq pointer to which the thread belongs in tsd. This way taskq_member will not need to touch tq_lock, and tsd has per slot spinlock. So the contention should be reduced greatly. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #500 Closes #504 Closes #505	2016-01-20 13:07:45 -08:00
Chunwei Chen	e843553d03	Don't hold mutex until release cv in cv_wait If a thread is holding mutex when doing cv_destroy, it might end up waiting a thread in cv_wait. The waiter would wake up trying to aquire the same mutex and cause deadlock. We solve this by move the mutex_enter to the bottom of cv_wait, so that the waiter will release the cv first, allowing cv_destroy to succeed and have a chance to free the mutex. This would create race condition on the cv_mutex. We use xchg to set and check it to ensure we won't be harmed by the race. This would result in the cv_mutex debugging becomes best-effort. Also, the change reveals a race, which was unlikely before, where we call mutex_destroy while test threads are still holding the mutex. We use kthread_stop to make sure the threads are exit before mutex_destroy. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue zfsonlinux/zfs#4166 Issue zfsonlinux/zfs#4106	2016-01-12 15:18:44 -08:00
Brian Behlendorf	2a552736b7	Fix do_div() types in condvar:timeout The do_div() macro expects unsigned types and this is detected in powerpc implementation of do_div(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #516	2015-12-22 10:32:35 -08:00
Chunwei Chen	b4ad50ac5f	Use spl_fstrans_mark instead of memalloc_noio_save For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #515 Issue zfsonlinux/zfs#4111	2015-12-18 13:24:52 -08:00
Tim Chase	200366f23f	Provide kstat for taskqs This patch provides 2 new kstats to display task queues: /proc/spl/taskqs-all - Display all task queues /proc/spl/taskqs - Display only "active" task queues A task queue is considered to be "active" if it currently has active (running) threads or if any of its pending, priority, delay or waitq lists are not empty. If the task queue has running threads, displays each thread function's address (symbolically, if possibly) and its argument. If the task queue has a non-empty list of pending, priority or delayed task queue entries (taskq_ent_t), displays each entry's thread function address and arguemnt. If the task queue has any waiters, displays each waiting task's pid. Note: This patch also updates some comments in taskq.h which referred to "taskq_t" when they should have referred to "taskq_ent_t". Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #491	2015-12-16 09:35:22 -08:00
Brian Behlendorf	2c4332cf79	Fix cstyle issues in spl-taskq.c and taskq.h This patch only addresses the issues identified by the style checker. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-12-11 16:20:22 -08:00
Chunwei Chen	066b89e685	Don't use tq->tq_lock_flags The flags argument in spin_lock_irqsave is modified out side of spin_lock context. We cannot use a shared variable like tq->tq_lock_flags for them. This patch removes it and uses local variable for the flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #506	2015-12-11 16:20:03 -08:00
Olaf Faaland	326172d854	Subclass tq_lock to eliminate a lockdep warning When taskq_dispatch() calls taskq_thread_spawn() to create a new thread for a taskq, linux lockdep warns of possible recursive locking. This is a false positive. One such call chain is as follows, when a taskq needs more threads: taskq_dispatch->taskq_thread_spawn->taskq_dispatch The initial taskq_dispatch() holds tq_lock on the taskq that needed more worker threads. The later call into taskq_dispatch() takes dynamic_taskq->tq_lock. Without subclassing, lockdep believes these could potentially be the same lock and complains. A similar case occurs when taskq_dispatch() then calls task_alloc(). This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one of two new lock subclasses: subclass taskq TQ_LOCK_DYNAMIC dynamic_taskq TQ_LOCK_GENERAL any other Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #480	2015-12-11 16:19:56 -08:00
Brian Behlendorf	c5a8b1e163	Revert "Make taskq_member() use ->journal_info" This reverts commit `a430c11f0b`. Using journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425! Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #500	2015-12-08 17:12:36 -08:00
Richard Yao	a430c11f0b	Make taskq_member() use ->journal_info The ->journal_info pointer in the task_struct is reserved for use by filesystems and because the kernel can have multiple file systems on the same stack due to direct reclaim, each filesystem that touches ->journal_info in a callback function will save the value at the start of its frame and restore it at the end of its frame. This allows us to safely use ->journal_info to store a pointer to the taskq's struct in taskq threads so that ZFS code paths can detect the presence of a taskq. This could break if the ZFS code were to use taskq_member from the context of direct reclaim. However, there are no such uses of it in that manner, so this is safe. This eliminates an O(N) list traversal under a spinlock with an O(1) unlocked pointer comparison. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #500	2015-12-08 13:24:47 -08:00
Richard Yao	1683e75edc	Fix race between getf() and areleasef() If a vnode is released asynchronously through areleasef(), it is possible for the user process to reuse the file descriptor before areleasef is called. When this happens, getf() will return a stale reference, any operations in the kernel on that file descriptor will fail (as it is closed) and the operations meant for that fd will never occur from userspace's perspective. We correct this by detecting this condition in getf(), doing a putf on the old file handle, updating the file descriptor and proceeding as if everything was fine. When the areleasef() is done, it will harmlessly decrement the reference counter on the Illumos file handle. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #492	2015-12-03 15:44:47 -08:00
tuxoko	d28c5c4f04	Prevent rm modules.* when make install This was originally in `e80cd06b8e`, but somehow was changed and not working anymore. And it will cause the following error: modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin' Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #501	2015-12-02 14:38:20 -08:00
Dimitri John Ledkov	9f456111ab	spl-kmem-cache: include linux/prefetch.h for prefetchw() This is needed for architectures that do not have a builtin prefetchw() Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #502	2015-12-02 12:45:06 -08:00
Brian Behlendorf	e7b75d9b46	Limit maximum object size in kmem tests Limit the maximum object size to 1/128 of total system memory for the kmem cache tests. Large values can result in out of memory errors for systems with less the 512M of memory. Additionally, use the known number of objects per-slab for calculating the number of objects to use for a test. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-11-16 15:02:24 -08:00
loli10K	31f24932a4	Remove superfluous `newline` character Remove superfluous `newline` character from spl_kmem_cache_magazine_size module parameter description. Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #499	2015-11-13 15:27:45 -08:00
tuxoko	f5f2b87df0	Fix taskq dynamic spawning Currently taskq_dispatch() will spawn new task with a condition that the caller is also a member of the taskq. However, under this condition, it will still cause deadlock where a task on tq1 is waiting another thread, who is trying to dispatch a task on tq1. So this patch removes the check. For example when you do: zfs send pp/fs0@001 \| zfs recv pp/fs0_copy This will easily deadlock before this patch. Also, move the seq_task check from taskq_thread_spawn() to taskq_thread() because it's not used by the caller from taskq_dispatch(). Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #496	2015-11-13 15:02:55 -08:00
Chunwei Chen	3e7e6f34d0	Don't call kmem_cache_shrink from shrinker Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see zfsonlinux/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#3936 Closes #487	2015-11-11 13:48:31 -08:00
Brian Behlendorf	9b13f65d28	Fix CPU hotplug Allocate a kmem cache magazine for every possible CPU which might be added to the system. This ensures that when one of these CPUs is enabled it can be safely used immediately. For many systems the number of online CPUs is identical to the number of present CPUs so this does imply an increased memory footprint. In fact, dynamically allocating the array of magazine pointers instead of using the worst case NR_CPUS can end up decreasing our memory footprint. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #482	2015-10-13 09:50:40 -07:00
Brian Behlendorf	2ebe396046	Fix PAX Patch/Grsec SLAB_USERCOPY panic Support grsecurity/PaX kernel configurations where CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option is enabled slabs which are used to copy between user and kernel space must be created with SLAB_USERCOPY. Stock Linux kernels do not have a SLAB_USERCOPY definition so this causes no change in behavior for non-PAX-enabled kernels. Verified-by: Wuffleton <null@wuffleton.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2977 Issue #3796	2015-09-28 09:18:29 -07:00
Richard Yao	d4bf6d8429	Disable direct reclaim in taskq worker threads on Linux 3.9+ Illumos does not have direct reclaim and code run inside taskq worker threads is not designed to deal with it. Allowing direct reclaim inside a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through memalloc_noio_save() to indicate to the kernel's reclaim code that we are inside a context where memory allocations cannot be allowed to block on filesystem activity. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1274 Issue zfsonlinux/zfs#2390 Closes #474	2015-09-09 14:30:47 -07:00
Brian Behlendorf	4fa4cab972	Linux 4.2 compat: misc_deregister() The misc_deregister() function was changed to a void return type. Rather than add compatibility code to detect this change simply ignore the return code on all kernels. It was only used to log an informational error message of no real value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-09-01 09:20:45 -07:00
Tim Chase	a64e55752f	Create a new thread during recursive taskq dispatch if necessary When dynamic taskq is enabled and all threads for a taskq are occupied, a recursive dispatch can cause a deadlock if calling thread depends on the recursively-dispatched thread for its return condition. This patch attempts to create a new thread for recursive dispatch when none are available. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #472	2015-09-01 08:46:41 -07:00
Brian Behlendorf	801b56090b	Revert "Create a new thread during recursive taskq dispatch if necessary" This reverts commit `076821e` due to a locking issue uncovered in subsequent testing. An ASSERT is hit due to tq->tq_nspawn being updated outside the lock. The patch will need to be reworked. VERIFY3(0 == tq->tq_nspawn) failed (0 == -1) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #472	2015-08-31 17:03:01 -07:00
Tim Chase	076821eaff	Create a new thread during recursive taskq dispatch if necessary When dynamic taskq is enabled and all threads for a taskq are occupied, a recursive dispatch can cause a deadlock if calling thread depends on the recursively-dispatched thread for its return condition. This patch attempts to create a new thread for recursive dispatch when none are available. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #472	2015-08-31 15:52:06 -07:00
Brian Behlendorf	ebc2c9374b	Linux 4.2 compat: vfs_rename() Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels results in an EACCES error. Rather than attempting to add and maintain more ugly compatibility code it's best to just retire this interface. As a first step the SPLAT test is disabled for Linux 4.2 and newer kernels. vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#3653	2015-08-19 16:03:29 -07:00
Brian Behlendorf	9dc5ffbec8	Invert minclsyspri and maxclsyspri On Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. Note this change also reverts some of the priorities changes in prior commit `62aa81a`. The rational is as follows: spl_kmem_cache - High priority may result in blocked memory allocs spl_system_taskq - May perform I/O for file backed VDEVs spl_dynamic_taskq - New taskq threads should be spawned promptly Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue zfsonlinux/zfs#3607	2015-07-28 13:59:03 -07:00
Brian Behlendorf	4699d76d19	Remove skc_ref from alloc/free paths As described in spl_kmem_cache_destroy() the ->skc_ref count was added to address the case of a cache reap or grow racing with a destroy. They are not strictly needed in the alloc/free paths because consumers of the cache are responsible for not using it while it's being destroyed. Removing this code is desirable because there is some evidence that contention on this atomic negative impacts performance on large-scale NUMA systems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #463	2015-07-24 11:11:45 -07:00
Brian Behlendorf	62aa81a577	Add defclsyspri macro Add a new defclsyspri macro which can be used to request the default Linux scheduler priority. Neither the minclsyspri or maxclsyspri map to the default Linux kernel thread priority. This makes it awkward to create taskqs which run with the same priority as the rest of the kernel threads on the system which can lead to performance issues. All SPL callers which previously used minclsyspri or maxclsyspri have been changed to use defclsyspri. The vast majority of callers were part of the test suite which won't have an external impact. The few places where it could impact performance the change was from maxclsyspri to defclsyspri. This makes it more likely the process will be scheduled which may help performance. To facilitate further performance analysis the spl_taskq_thread_priority module option has been added. When disabled (0) all newly created kernel threads will use the default kernel thread priority. When enabled (1) the specified taskq priority will be used. By default this value is enabled (1). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-23 13:25:49 -07:00
Brian Behlendorf	9eb361aaa5	Default to --disable-debug-kmem The default kmem debugging (--enable-debug-kmem) can severely impact performance on large-scale NUMA systems due to the atomic operations used in the memory accounting. A 32-thread fio test running on a 40-core 80-thread system and performing 100% cached reads with kmem debugging is: Enabled: READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s, Disabled: READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issues #463	2015-07-21 11:47:10 -07:00
Turbo Fredriksson	37d7cd94f3	Support parallel build trees (VPATH builds) Build products from an out of tree build should be written relative to the build directory. Sources should be referred to by their locations in the source directory. This is accomplished by adding the 'src' and 'obj' variables for the module Makefile.am, using relative paths to reference source files, and by setting VPATH when source files are not co-located with the Makefile. This enables the following: $ mkdir build $ cd build $ ../configure $ make -s This change also has the advantage of resolving the following warning which is generated by modern versions of automake. Makefile.am:00: warning: source file 'xxx' is in a subdirectory, Makefile.am:00: but option 'subdir-objects' is disabled Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#1082	2015-07-17 12:53:11 -07:00
Brian Behlendorf	3c82160ff2	Set TASKQ_DYNAMIC for kmem and system taskqs Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs to reduce the number of idle threads on the system. Additional threads will be created on demand up to the previous maximum thread counts. This should have minimal, if any, impact on performance. This makes the system taskq consistent with illumos which is always created as a dynamic taskq with up to 64 threads. The task limits for the kmem_cache have been increased to avoid any unnessisary throttling and to keep a larger reserve of task_t structures on the free list. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:25 -07:00
Brian Behlendorf	f7a973d99b	Add TASKQ_DYNAMIC feature Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic semantics. Initially only a single worker thread will be created to service tasks dispatched to the queue. As additional threads are needed they will be dynamically spawned up to the max number specified by 'nthreads'. When the threads are no longer needed, because the taskq is empty, they will automatically terminate. Due to the low cost of creating and destroying threads under Linux by default new threads and spawned and terminated aggressively. There are two modules options which can be tuned to adjust this behavior if needed. * spl_taskq_thread_sequential - The number of sequential tasks, without interruption, which needed to be handled by a worker thread before a new worker thread is spawned. Default 4. * spl_taskq_thread_dynamic - Provides the ability to completely disable the use of dynamic taskqs on the system. This is provided for the purposes of debugging and troubleshooting. Default 1 (enabled). This behavior is fundamentally consistent with the dynamic taskq implementation found in both illumos and FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:18 -07:00
Brian Behlendorf	2345368646	Rename cv_wait_interruptible() to cv_wait_sig() Commit `f752b46e` added the cv_wait_interruptible() function to allow condition variables to be woken by signals. This function and its timed wait counterpart should have been named cv_wait_sig() to match the illumos interface which provides the same functionality. This patch renames the symbol but leaves a #define compatibility wrapper in place until the ZFS code can be moved to the correct name. This patch also makes a small number of cosmetic changes to make the condvar source and header cstyle clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #456	2015-06-10 16:36:12 -07:00
Chris Dunlop	a876b0305e	Make taskq_wait() block until the queue is empty Under Illumos taskq_wait() returns when there are no more tasks in the queue. This behavior differs from ZoL and FreeBSD where taskq_wait() returns when all the tasks in the queue at the beginning of the taskq_wait() call are complete. New tasks added whilst taskq_wait() is running will be ignored. This difference in semantics makes it possible that new subtle issues could be introduced when porting changes from Illumos. To avoid that possibility the taskq_wait() function is being updated such that it blocks until the queue in empty. The previous behavior remains available through the taskq_wait_outstanding() interface. Note that this function was previously called taskq_wait_all() but has been renamed to avoid confusion. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #455	2015-06-09 12:20:12 -07:00
Brian Behlendorf	62e2eb2329	Fix cstyle issues in spl-tsd.c This patch only addresses the issues identified by the style checker in spl-tsd.c. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-24 14:23:07 -07:00
Chunwei Chen	3d39d0afab	Make tsd_set(key, NULL) remove the tsd entry for current thread To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd entry for the current thread. This is alright since tsd_get() returns NULL when the entry doesn't exist. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #443	2015-04-24 14:15:22 -07:00
Richard Yao	d3c677bcd3	Implement areleasef() Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #449	2015-04-24 13:02:37 -07:00
Richard Yao	313b1ea622	vn_getf/vn_releasef should not accept negative file descriptors C type coercion rules require that negative numbers be converted into positive numbers via wraparound such that a negative -1 becomes a positive 1. This causes vn_getf to return a file handle when it should return NULL whenever a positive file descriptor existed with the same value. We should check for a negative file descriptor and return NULL instead. This was caught by ClusterHQ's unit testing. Reference: http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #450	2015-04-24 13:02:00 -07:00
Brian Behlendorf	2a5d574eca	Clear PF_FSTRANS over vfs_sync() When layered on XFS the following warning will be emitted under CentOS7 when entering vfs_fsync() with PF_FSTRANS already set. This is not an issue for other stock Linux file systems and the warning was removed for newer kernels. However, to avoid triggering this error PF_FSTRANS is cleared and then reset in vn_fsync(). WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0 Call Trace: [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 [<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs] [<ffffffff8114b833>] __writepage+0x13/0x50 [<ffffffff8114c341>] write_cache_pages+0x251/0x4d0 [<ffffffff8114c60d>] generic_writepages+0x4d/0x80 [<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs] [<ffffffff8114d68e>] do_writepages+0x1e/0x40 [<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80 [<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70 [<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs] [<ffffffff811df54b>] vfs_fsync+0x2b/0x40 [<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl] [<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs] [<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs] [<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs] [<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs] [<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs] [<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs] [<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs] [<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0 [<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0 [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-04-07 15:03:47 -07:00
Tim Chase	ae26dd0039	Don't allow shrinking a PF_FSTRANS context Avoid deadlocks when entering the shrinker from a PF_FSTRANS context. This patch also reverts commit `d0d5dd7` which added MUTEX_FSTRANS. Its use has been deprecated within ZFS as it was an ineffective mechanism to eliminate deadlocks. Among other things, it introduced the need for strict ordering of mutex locking and unlocking in order that the PF_FSTRANS flag wouldn't set incorrectly. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #446	2015-04-03 11:32:31 -07:00
Brian Behlendorf	6ab08667a4	Reduce splat_taskq_test2_impl() stack frame size Slightly increasing the size of a kmutex_t has caused us to exceed the stack frame warning size in splat_taskq_test2_impl(). To address this the tq_args have been moved to the heap. cc1: warnings being treated as errors spl-0.6.3/module/splat/splat-taskq.c:358: error: the frame size of 1040 bytes is larger than 1024 bytes Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:18:31 -08:00
Brian Behlendorf	d0d5dd7144	Add MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit `9099312` - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:18:24 -08:00

1 2 3 4 5 ...

473 Commits