mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-23 10:54:35 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	95331f4437	Set KMC_NOEMERGENCY for zlib workspaces The workspace required by zlib to perform compression is roughly 512MB (order-7). These allocations are so large that we should never attempt to directly kmalloc an emergency object for them. It is far preferable to asynchronously vmalloc an additional slab in case it's needed. Then simply block waiting for an existing object to be released or for the new slab to be allocated. This can be accomplished by disabling emergency slab objects by passing the KMC_NOEMERGENCY flag at slab creation time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> zfsonlinux/zfs#917	2012-09-07 14:36:26 -07:00
Brian Behlendorf	cb5c2acebb	Add KMC_NOEMERGENCY slab flag Provide a flag to disable the use of emergency objects for a specific kmem cache. There may be instances where under no circumstances should you kmalloc() an emergency object. For example, when you cache contains very large objects (>128k). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-07 14:27:03 -07:00
Brian Behlendorf	46b3945d5d	Suppress task_hash_table_init() large allocation warning When various kernel debuging options are enabled this allocation may be larger than usual as shown by the following warning. It is in no way harmful so we suppress the warning. SPL: large kmem_alloc(40960, 0x80d0) at tsd_hash_table_init:358 (76495/76495) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #93	2012-08-30 21:02:52 -07:00
Brian Behlendorf	efcd0ca32d	Enhance SPLAT kmem:slab_overcommit test After the emergency slab objects were merged I started observing timeout failures in the kmem:slab_overcommit test. These were due to the ineffecient way the slab_overcommit reclaim function was implemented. And due to the additional cost of potentially allocating ten of thousands of emergency objects and tracking them on a single list. This patch addresses the first concern by enhansing the test case to trace all of the allocations objects as a linked list. This allows for a cleaner version of the reclaim function to simply release SPLAT_KMEM_OBJ_RECLAIM objects. Since this touches some common code all the tests which share these data structions were also updated. After making these changes slab_overcommit is reliably passing. However, there is certainly additional cleanup which could be done here. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-30 15:49:00 -07:00
Brian Behlendorf	cd5ca4b2f8	Switch KM_SLEEP to KM_PUSHPAGE Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	500e95c884	Revert "Disable vmalloc() direct reclaim" This reverts commit `2092cf68d8`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	617f79de6a	Revert "Fix NULL deref in balance_pgdat()" This reverts commit `b8b6e4c453`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	bc03e07a7c	Revert "Detect kernels that honor gfp flags passed to vmalloc()" This reverts commit `36811b4430`. Which is no longer required because there is now SPL code in place to safely handle the deadlocks the kernel patch was designed to address. Therefore we can unconditionally use vmalloc() and drop all the PF_MEMALLOC code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	d47e664ad4	Revert "Add TASKQ_NORECLAIM flag" This reverts commit `372c257233`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Brian Behlendorf	e2dcc6e2b8	Emergency slab objects This patch is designed to resolve a deadlock which can occur with __vmalloc() based slabs. The issue is that the Linux kernel does not honor the flags passed to __vmalloc(). This makes it unsafe to use in a writeback context. Unfortunately, this is a use case ZFS depends on for correct operation. Fixing this issue in the upstream kernel was pursued and patches are available which resolve the issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 However, these changes were rejected because upstream felt that using __vmalloc() in the context of writeback should never be done. Their solution was for us to rewrite parts of ZFS to accomidate the Linux VM. While that is probably the right long term solution, and it is something we want to pursue, it is not a trivial task and will likely destabilize the existing code. This work has been planned for the 0.7.0 release but in the meanwhile we want to improve the SPL slab implementation to accomidate this expected ZFS usage. This is accomplished by performing the __vmalloc() asynchronously in the context of a work queue. This doesn't prevent the posibility of the worker thread from deadlocking. However, the caller can now safely block on a wait queue for the slab allocation to complete. Normally this will occur in a reasonable amount of time and the caller will be woken up when the new slab is available,. The objects will then get cached in the per-cpu magazines and everything will proceed as usual. However, if the __vmalloc() deadlocks for the reasons described above, or is just very slow, then the callers on the wait queues will timeout out. When this rare situation occurs they will attempt to kmalloc() a single minimally sized object using the GFP_NOIO flags. This allocation will not deadlock because kmalloc() will honor the passed flags and the caller will be able to make forward progress. As long as forward progress can be maintained then even if the worker thread is deadlocked the critical thread will make progress. This will eventually allow the deadlocked worker thread to complete and normal operation will resume. These emergency allocations will likely be slow since they require contiguous pages. However, their use should be rare so the impact is expected to be minimal. If that turns out not to be the case in practice further optimizations are possible. One additional concern is if these emergency objects are long lived. Right now they are simply tracked on a list which must be walked when an object is freed. Is they accumulate on a system and the list grows freeing objects will become more expensive. This could be handled relatively easily by using a hash instead of a list, but that optimization (if needed) is left for a follow up patch. Additionally, these emeregency objects could be repacked in to existing slabs as objects are freed if the kmem_cache_set_move() functionality was implemented. See issue https://github.com/zfsonlinux/spl/issues/26 for full details. This work would also help reduce ZFS's memory fragmentation problems. The /proc/spl/kmem/slab file has had two new columns added at the end. The 'emerg' column reports the current number of these emergency objects in use for the cache, and the following 'max' column shows the historical worst case. These value should give us a good idea of how often these objects are needed. Based on these values under real use cases we can tune the default behavior. Lastly, as a side benefit using a single work queue for the slab allocations should reduce cpu contention on the global virtual address space lock. This should manifest itself as reduced cpu usage for the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Prakash Surya	08850eddcb	Avoid calling smp_processor_id in spl_magazine_age The spl_magazine_age function had the implied assumption that it will remain on its current cpu through its execution. In order to support preempt enabled kernels, this assumption had to be removed. The spl_kmem_magazine structure now holds the cpu id of the cpu it is local to. This allows spl_magazine_age to use this field when scheduling work to be done by the magazine's local cpu. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #98	2012-08-24 09:43:22 -07:00
Richard Yao	15d0411297	Remove Makefile from non-toplevel .gitignore files When building SPL support into the kernel, ./copy-builtin will copy non-toplevel .gitignore files. These files list /Makefile, which causes git-archive to omit ./module/{spl,splat}/Makefile. The absence of these files result in build failures when SPL is selected. ZFS is unaffected because it puts Makefile in the toplevel .gitignore, which is not copied. We fix SPL by emulating that behavior. Reported-by: Fabio Erculiani <lxnay@gentoo.org> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #152	2012-08-23 12:49:04 -07:00
Prakash Surya	9baf44bc17	Wrap trace_set_debug_header in trace_[get\|put]_tcd To properly support CONFIG_PREEMPT enabled kernels, we must refrain from using a CPU index when preemption is enabled. As a result, this change moves the trace_set_debug_header call (which calls smp_processor_id) within trace_get_tcd and trace_put_tcd (which disable and enable preemption respectively). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #160	2012-08-23 10:01:20 -07:00
Richard Yao	6576a1a70d	Fix incorrect type in spl_kmem_cache_set_move() parameter A preprocessor definition renders this harmless. However, it is a good idea to change this to be consistent. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>	2012-08-01 16:35:18 -07:00
Etienne Dechamps	a9f2397ee9	Determine the hostid on demand. Currently, the SPL tries to determine the hostid at module load. The hostid is usually determined by running the userland program "hostid" during module initialization. Unfortunately, when the module initializes, it may be way too soon to be able to run any userland programs. This is especially true when the module is compiled directly inside the kernel (built-in); in that case, the SPL would try to run hostid when the kernel is still initializing, which of course is doomed to fail. This patch fixes the issue by deferring hostid generation until something actually needs the hostid (that is, when zone_get_hostid() is called), thus switching to a "on-initialization" model to a "on-demand" (lazy loading) model. ZFS only needs the hostid when some pool operations are requested, and this always happens way after the kernel has finished initialization, thus solving the problem. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 15:14:02 -07:00
Etienne Dechamps	c167aadb27	Add script for builtin module building. This commit introduces a "copy-builtin" script designed to prepare a kernel source tree for building SPL as a builtin module. The script makes a full copy of all needed files, thus making the kernel source tree fully independent of the spl source package. To achieve that, some compilation flags (-include, -I) have been moved to module/Makefile. This Makefile is only used when compiling external modules; when compiling builtin modules, a Kbuild file generated by the configure-builtin script is used instead. This makes sure Makefiles inside the kernel source tree does not contain references to the spl source package. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 15:13:09 -07:00
Etienne Dechamps	38b5ff4d07	Fix undefined reference on spl_mutex_spin_max(). Commit `3160d4f56b` changed the set of conditions under which spl_mutex_spin_max would be implemented as a function by changing an #if in sys/mutex.h. The corresponding implementation file spl-mutex.c, however, has not been updated to reflect the change. This results in undefined reference errors on spl_mutex_spin_max under the following condition: ((!CONFIG_SMP \|\| CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR) This patch fixes the issue by using the same #if in sys/mutex.h and spl-mutex.c. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 14:54:53 -07:00
Etienne Dechamps	94aac9c9bc	Use MODULE variable in module Makefile like zfs. In zfs, each module Makefile contains a MODULE variable which contains the name of the module, and the following declarations reference this variable. In spl, there is a MODULES variable which is never used. Rename it to MODULE and use it like in zfs. This improves consistency between the two build systems. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#851	2012-07-26 14:53:48 -07:00
Brian Behlendorf	e8267acd25	32-bit compat, hostid_read() Explicitly cast the sizeof in hostid_read() to prevent the following compiler warning on 32-bit systems. module/spl/spl-generic.c:490:10: error: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'unsigned int' [-Werror=format] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-20 11:14:04 -07:00
Richard Yao	36811b4430	Detect kernels that honor gfp flags passed to vmalloc() zfsonlinux/spl@2092cf68d8 used PF_MEMALLOC to workaround a bug in the Linux kernel where allocations did not honor the gfp flags passed to vmalloc(). Unfortunately, PF_MEMALLOC has the side effect of permitting allocations to allocate pages outside of ZONE_NORMAL. This has been observed to result in the depletion of ZONE_DMA32. A kernel patch is available in the Gentoo bug tracker for this issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 This negates any benefit PF_MEMALLOC provides, so we introduce an autotools check to disable the use of PF_MEMALLOC on systems with patched kernels. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #126	2012-07-11 11:44:27 -07:00
Richard Yao	973e8269bd	Constify memory management functions This prevents warnings in ZFS that were caused by changes necessary to support PaX patched kernels. When debugging is enabled, these warnings become build failures. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #131	2012-07-03 16:07:27 -07:00
Brian Behlendorf	44e406d712	PowerPC Compatibility Usage of get_current() is not supported across all architectures. The correct interface to use is the '#define current' which will map to the appropriate function, usually current_thread_info(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #119	2012-07-02 09:33:09 -07:00
Richard Yao	e0093fea58	Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR torvalds/linux@1dce27c5aa introduced __clear_close_on_exec() as a replacement for FD_CLR. Further commits appear to have removed FD_CLR from the Linux source tree. This causes the following failure: error: implicit declaration of function '__FD_CLR' [-Werror=implicit-function-declaration] To correct this we update the code to use the current __clear_close_on_exec() interface for readability. Then we introduce an autotools check to determine if __clear_close_on_exec() is available. If it isn't then we define some compatibility logic which used the older FD_CLR() interface. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #124	2012-06-13 16:18:51 -07:00
Brian Behlendorf	eaac9ba510	Fix uninit variable in slab reclaim test Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test as potentially unitialized. In practice this will never occur but to keep gcc happy we initialize the variable to zero. Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>	2012-06-13 16:17:22 -07:00
Brian Behlendorf	2371321e8a	Fix invalid context bug In the module unload path the vm_file_cache was being destroyed under a spin lock. Because this operation might sleep it was possible, although very very unlikely, that this could result in a deadlock. This issue was indentified by using a Linux debug kernel and has been fixed by moving the kmem_cache_destroy() out from under the spin lock. There is no need to lock this operation here. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#771	2012-06-11 09:17:45 -07:00
Jorgen Lundman	93b0dc92ea	Fix ARM 64-bit division Correctly implementating 64-bit division for ARM requires more than just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols. They are need to be implemented is such a way that the quotient and remainder and left in specific registers after the division operation completes. This change updates the wrapper functions to accomplish this according to the official ARM Run-time ABI. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#706	2012-05-22 09:27:11 -07:00
Brian Behlendorf	38d31a1e57	Remove Solaris module emulation Originally I believed that these interfaces would be needed. However, in practice it turned out that it was more straight forward and maintainable to use the native Linux interfaces. As such, this is all dead code and can be safely removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #109	2012-05-18 13:57:44 -07:00
Prakash Surya	a9a7a01cf5	Add SPLAT test to exercise slab direct reclaim This test is designed to verify that direct reclaim is functioning as expected. We allocate a large number of objects thus creating a large number of slabs. We then apply memory pressure and expect that the direct reclaim path can easily recover those slabs. The registered reclaim function will free the objects and the slab shrinker will call it repeatedly until at least a single slab can be freed. Note it may not be possible to reclaim every last slab via direct reclaim without a failure because the shrinker_rwsem may be contended. For this reason, quickly reclaiming 3/4 of the slabs is considered a success. This should all be possible within 10 seconds. For reference, on a system with 2G of memory this test takes roughly 0.2 seconds to run. It may take longer on larger memory systems but should still easily complete in the alloted 10 seconds. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:55:59 -07:00
Brian Behlendorf	b78d4b9d98	Ensure a minimum of one slab is reclaimed To minimize the chance of triggering an OOM during direct reclaim. The kmem caches have been improved to make a best effort to reclaim at least one slab when a reclaim function is registered. This helps avoid the case where objects are released but they are spread over multiple slabs so no memory gets reclaimed. Care has been taken to avoid deadlocking if the reclaim function is unable to make forward progress. Additionally, the reclaim function may be skipped entirely if there are already free slabs which can be safely reaped. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:54:28 -07:00
Brian Behlendorf	06089b9e19	Ensure direct reclaim forward progress The Linux direct reclaim path uses this out of band value to determine if forward progress is being made. Normally this is incremented by kmem_freepages() which is part of the various Linux slab implementations. However, since we are using none of that infrastructure we're responsible for incrementing this count. If no forward progress is detected and a subsequent allocation fails the OOM killer will be invoked. If there was forward progress additional reclaim will be attempted via the page cache and registerd shrinker until the allocation succeeds. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #107	2012-05-07 11:54:19 -07:00
Prakash Surya	c0e0fc14e3	Ignore slab cache age and delay in direct reclaim When memory pressure triggers direct memory reclaim, a slabs age and delay should not prevent it from being freed. This patch ensures these values are ignored, allowing an empty slab to be freed in this code path no matter the value of its age and delay. This prevents needless scanning of the partial slabs and has been observed to significantly reduce the total cpu usage. In addition, it should allow for snappier reclaim under memory pressure. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #102	2012-05-07 11:50:04 -07:00
Prakash Surya	cef7605c34	Throttle number of freed slabs based on nr_to_scan Previously, the SPL tried to maintain Solaris semantics by freeing all available (empty) slabs from its slab caches when the shrinker was called. This is not desirable when running on Linux. To make the SPL shrinker more Linux friendly, the actual number of freed slabs from each of the slab caches is now derived from nr_to_scan and skc_slab_objs. Additionally, an accounting bug was fixed in spl_slab_reclaim() which could cause us to reclaim one more slab than requested. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #101	2012-05-07 11:46:15 -07:00
Jorgen Lundman	ef6f91ce0c	Add missing 64-bit divide for 32-bit ARM Leverage the existing generic 64-bit division operations which were originally implemented for x86 to support ARM. All that is required is to make the symbols available to the linker with the expected names. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-05-03 10:07:54 -07:00
Prakash Surya	05b8f50c33	Update a comment to reflect new taskq internals As of the removal of the taskq work list made in commit: commit `2c02b71b14` Author: Prakash Surya <surya1@llnl.gov> Date: Mon Dec 5 17:32:48 2011 -0800 Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. the comment above taskq_wait_check has been incorrect. This change is an attempt at bringing that description more in line with the current implementation. Essentially, references to the old task work list had to be updated to reference the new taskq thread active list. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2012-04-30 10:49:15 -07:00
Brian Behlendorf	b29012b999	Remove condition variable names Long ago I added support to the spl for condition variable names because I thought they might be needed. It turns out they aren't. In fact the official Solaris cv_init(9F) man page discourages their use in the kernel. cv_init(9F) Parameters name - Descriptive string. This is obsolete and should be NULL. (Non-NULL strings are legal, but they're a waste of kernel memory.) Therefore, I'm removing them from the spl to reclaim this memory and adding an ASSERT() to ensure no new consumers are added which make use of the name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-06 12:06:19 -07:00
Brian Behlendorf	0835057ee7	Add SPL_META_RELEASE to module load/unload messages Include the ZFS_META_RELEASE in the module load/unload messages to more clearly indicate exactly what version of the SPL has been loaded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:11:50 -07:00
Brian Behlendorf	9a8b7a7458	Add basic dynamic kstat support Add the bare minimum functionality to support dynamic kstats. A complete kstat implementation should be done as part of issue #84. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #84	2012-02-02 11:28:00 -08:00
Brian Behlendorf	4b2220f0b9	Add --enable-debug-log configure option Until now the notion of an internal debug logging infrastructure was conflated with enabling ASSERT()s. This patch clarifies things by cleanly breaking the two subsystem apart. The result of this is the following behavior. --enable-debug - Enable/disable code wrapped in ASSERT()s. --disable-debug ASSERT()s are used to check invariants and are never required for correct operation. They are disabled by default because they may impact performance. --enable-debug-log - Enable/disable the debug log infrastructure. --disable-debug-log This infrastructure allows the spl code and its consumer to log messages to an in-kernel log. The granularity of the logging can be controlled by a debug mask. By default the mask disables most debug messages resulting in a negligible performance impact. Because of this the debug log is enabled by default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:27:54 -08:00
Ned Bass	3c6ed5410b	Taskq locking optimizations Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq Taskq threads need only add themselves to the work wait queue if there are no pending work items. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-19 14:42:49 -08:00
Ned Bass	0bb43ca282	Revert "Taskq locking optimizations" This reverts commit `ec2b41049f`. A race condition was introduced by which a wake_up() call can be lost after the taskq thread determines there is no pending work items, leading to deadlock: 1. taksq thread enables interrupts 2. dispatcher thread runs, queues work item, call wake_up() 3. taskq thread runs, adds self to waitq, sleeps This could easily happen if an interrupt for an IO completion was outstanding at the point where the taskq thread reenables interrupts, just before the call to add_wait_queue_exclusive(). The handler would run immediately within the race window. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-19 14:42:39 -08:00
Ned Bass	ec2b41049f	Taskq locking optimizations Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq outside of tq_lock Taskq threads need only add themselves to the work wait queue if there are no pending work items. Furthermore, the add and remove function calls can be made outside of the taskq lock since the wait queues are protected from concurrent access by their own spinlocks. 3) Call wake_up() outside of tq->tq_lock Again, the wait queues are protected by their own spinlock, so the dispatcher functions can drop tq->tq_lock before calling wake_up(). A new splat test taskq:contention was added in a prior commit to measure the impact of these changes. The following table summarizes the results using data from the kernel lock profiler. tq_lock time %diff Wall clock (s) %diff original: 39117614.10 0 41.72 0 exclusive threads: 31871483.61 18.5 34.2 18.0 unlocked add/rm waitq: 13794303.90 64.7 16.17 61.2 unlocked wake_up(): 1589172.08 95.9 16.61 60.2 Each row reflects the average result over 5 test runs. /proc/lock_stats was zeroed out before and collected after each run. Column 1 is the cumulative hold time in microseconds for tq->tq_lock. The tests are cumulative; each row reflects the code changes of the previous rows. %diff is calculated with respect to "original" as 100*(orig-new)/orig. Although calling wake_up() outside of the taskq lock dramatically reduced the taskq lock hold time, the test actually took slightly more wall clock time. This is because the point of contention shifts from the taskq lock to the wait queue lock. But the change still seems worthwhile since it removes our taskq implementation as a bottleneck, assuming the small increase in wall clock time to be statistical noise. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #32	2012-01-18 10:36:57 -08:00
Ned Bass	cf5d23fa1e	Add taskq contention splat test Add a test designed to generate contention on the taskq spinlock by using a large number of threads (100) to perform a large number (131072) of trivial work items from a single queue. This simulates conditions that may occur with the zio free taskq when a 1TB file is removed from a ZFS filesystem, for example. This test should always pass. Its purpose is to provide a benchmark to easily measure the effectiveness of taskq optimizations using statistics from the kernel lock profiler. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32	2012-01-18 10:36:51 -08:00
Darik Horn	966e5200d3	Fix `make distclean` for `--with-config=user` Apply the same fix to SPL that was applied to ZFS earlier at: zfsonlinux/zfs@d433c20651 Additionally quote @LINUX_SYMBOLS@ because it is a null substitution in this configuration, which results in a `[ -f ]` expression that incorrectly evaluates to true. # ./configure --with-config=user # make distclean Making distclean in module make[1]: Entering directory `/spl/module' make -C SUBDIRS=`pwd` clean make: Entering an unknown directory make: *** SUBDIRS=/spl/module: No such file or directory. Stop. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-17 10:06:00 -08:00
Brian Behlendorf	5f6c14b1ed	Proxmox VE kernel compat, invalidate_inodes() The Proxmox VE kernel contains a patch which renames the function invalidate_inodes() to invalidate_inodes_check(). In the process it adds a 'check' argument and a '#define invalidate_inodes(x)' compatibility wrapper for legacy callers. Therefore, if either of these functions are exported invalidate_inodes() can be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #58	2011-12-21 14:29:45 -08:00
Prakash Surya	8f2503e0af	Store copy of tqent_flags prior to servicing task A preallocated taskq_ent_t's tqent_flags must be checked prior to servicing the taskq_ent_t. Once a preallocated taskq entry is serviced, the ownership of the entry is handed back to the caller of taskq_dispatch, thus the entry's contents can potentially be mangled. In particular, this is a problem in the case where a preallocated taskq entry is serviced, and the caller clears it's tqent_flags field. Thus, when the function returns and task_done is called, it looks as though the entry is not a preallocated task (when in fact it is a preallocated task). In this situation, task_done will place the preallocated taskq_ent_t structure onto the taskq_t's free list. This is a huge mistake. If the taskq_ent_t is then freed by the caller of taskq_dispatch, the taskq_t's free list will hold a pointer to garbage data. Even worse, if nothing has over written the freed memory before the pointer is dereferenced, it may still look as though it points to a valid list_head belonging to a taskq_ent_t structure. Thus, the task entry's flags are now copied prior to servicing the task. This copy is then checked to see if it is a preallocated task, and determine if the entry needs to be passed down to the task_done function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #71	2011-12-16 16:54:00 -08:00
Prakash Surya	e7e5f78e7b	Swap taskq_ent_t with taskqid_t in taskq_thread_t The taskq_t's active thread list is sorted based on its tqt_ent->tqent_id field. The list is kept sorted solely by inserting new taskq_thread_t's in their correct sorted location; no other means is used. This means that once inserted, if a taskq_thread_t's tqt_ent->tqent_id field changes, the list runs the risk of no longer being sorted. Prior to the introduction of the taskq_dispatch_prealloc() interface, this was not a problem as a taskq_ent_t actively being serviced under the old interface should always have a static tqent_id field. Thus, once the taskq_thread_t is added to the taskq_t's active thread list, the taskq_thread_t's tqt_ent->tqent_id field would remain constant. Now, this is no longer the case. Currently, if using the taskq_dispatch_prealloc() interface, any given taskq_ent_t actively being serviced _may_ have its tqent_id value incremented. This happens when the preallocated taskq_ent_t structure is recursively dispatched. Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id field silently modified from under its feet. If this were to happen to a taskq_thread_t on a taskq_t's active thread list, this would compromise the integrity of the order of the list (as the list _may_ no longer be sorted). To get around this, the taskq_thread_t's taskq_ent_t pointer was replaced with its own static copy of the tqent_id. So, as a taskq_ent_t is pulled off of the taskq_t's pending list, a static copy of its tqent_id is made and this copy is used to sort the active thread list. Using a static copy is key in ensuring the integrity of the order of the active thread list. Even if the underlying taskq_ent_t is recursively dispatched (as has its tqent_id modified), this static copy stored inside the taskq_thread_t will remain constant. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #71	2011-12-16 13:26:54 -08:00
Prakash Surya	699d5ee8a9	Exercise new taskq interface in splat-taskq tests The splat-taskq test functions were slightly modified to exercise the new taskq interface in addition to the old interface. If the old interface passes each of its tests, the new interface is exercised. Both sub tests (old interface and new interface) must pass for each test as a whole to pass. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #65	2011-12-13 16:10:57 -08:00
Prakash Surya	44217f7aad	Implement taskq_dispatch_prealloc() interface This patch implements the taskq_dispatch_prealloc() interface which was introduced by the following illumos-gate commit. It allows for a preallocated taskq_ent_t to be used when dispatching items to a taskq. This eliminates a memory allocation which helps minimize lock contention in the taskq when dispatching functions. commit 5aeb94743e3be0c51e86f73096334611ae3a058e Author: Garrett D'Amore <garrett@nexenta.com> Date: Wed Jul 27 07:13:44 2011 -0700 734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	ac1e5b6033	Add Test: "Single task queue, recursive dispatch" Added another splat taskq test to ensure tasks can be recursively submitted to a single task queue without issue. When the taskq_dispatch_prealloc() interface is introduced, this use case can potentially cause a deadlock if a taskq_ent_t is dispatched while its tqent_list field is not empty. This _should_ never be a problem with the existing taskq_dispatch() interface. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	2c02b71b14	Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. The tq_threads field was replaced with tq_thread_list. Rather than storing the pointers to the taskq's kernel threads in an array, they are now stored as a list. In addition to laying the ground work for the taskq_dispatch_prealloc() interface, this change could also enable taskq threads to be dynamically created and destroyed as threads can now be added and removed to this list relatively easily. The tq_work_list field was replaced with tq_active_list. Instead of keeping a list of taskq_ent_t's which are currently being serviced, a list of taskq_threads currently servicing a taskq_ent_t is kept. This frees up the taskq_ent_t's tqent_list field when it is being serviced (i.e. now when a taskq_ent_t is being serviced, it's tqent_list field will be empty). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:50 -08:00

1 2 3 4 5 ...

329 Commits