mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	de77e24590	Linux 4.5 compat: pfn_t typedef The pfn_t typedef was inherited from Illumos but never directly used by any SPL consumers. This didn't cause any issues until the Linux 4.5 kernel introduced a typedef of the same name. See torvalds/linux/commit/34c0fd54, this patch removes the unused Illumos version to prevent a conflict. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #524	2016-01-20 11:39:18 -08:00
Chunwei Chen	d348f23a6a	Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark In `b4ad50a`, we abandoned memalloc_noio_save in favor of spl_fstrans_mark because earlier kernel with it doesn't turn off __GFP_FS. However, for newer kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation in kernel which we cannot control otherwise. So in this patch, we turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #523	2016-01-20 11:38:31 -08:00
Alex McWhirter	466bcf3be5	_ILP32 is always defined on SPARC Signed-off-by: Alex McWhirter <alexmcwhirter@triadic.us> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #520	2016-01-08 11:59:38 -08:00
Chunwei Chen	b4ad50ac5f	Use spl_fstrans_mark instead of memalloc_noio_save For earlier versions of the kernel with memalloc_noio_save, it only turns off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This would cause threads to direct reclaim into ZFS and cause deadlock. Instead, we should stick to using spl_fstrans_mark. Since we would explicitly turn off both __GFP_IO and __GFP_FS before allocation, it will work on every version of the kernel. This impacts kernel versions 3.9-3.17, see upstream kernel commit torvalds/linux@934f307 for reference. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #515 Issue zfsonlinux/zfs#4111	2015-12-18 13:24:52 -08:00
Tim Chase	200366f23f	Provide kstat for taskqs This patch provides 2 new kstats to display task queues: /proc/spl/taskqs-all - Display all task queues /proc/spl/taskqs - Display only "active" task queues A task queue is considered to be "active" if it currently has active (running) threads or if any of its pending, priority, delay or waitq lists are not empty. If the task queue has running threads, displays each thread function's address (symbolically, if possibly) and its argument. If the task queue has a non-empty list of pending, priority or delayed task queue entries (taskq_ent_t), displays each entry's thread function address and arguemnt. If the task queue has any waiters, displays each waiting task's pid. Note: This patch also updates some comments in taskq.h which referred to "taskq_t" when they should have referred to "taskq_ent_t". Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #491	2015-12-16 09:35:22 -08:00
Brian Behlendorf	2c4332cf79	Fix cstyle issues in spl-taskq.c and taskq.h This patch only addresses the issues identified by the style checker. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-12-11 16:20:22 -08:00
Chunwei Chen	066b89e685	Don't use tq->tq_lock_flags The flags argument in spin_lock_irqsave is modified out side of spin_lock context. We cannot use a shared variable like tq->tq_lock_flags for them. This patch removes it and uses local variable for the flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #506	2015-12-11 16:20:03 -08:00
Olaf Faaland	326172d854	Subclass tq_lock to eliminate a lockdep warning When taskq_dispatch() calls taskq_thread_spawn() to create a new thread for a taskq, linux lockdep warns of possible recursive locking. This is a false positive. One such call chain is as follows, when a taskq needs more threads: taskq_dispatch->taskq_thread_spawn->taskq_dispatch The initial taskq_dispatch() holds tq_lock on the taskq that needed more worker threads. The later call into taskq_dispatch() takes dynamic_taskq->tq_lock. Without subclassing, lockdep believes these could potentially be the same lock and complains. A similar case occurs when taskq_dispatch() then calls task_alloc(). This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one of two new lock subclasses: subclass taskq TQ_LOCK_DYNAMIC dynamic_taskq TQ_LOCK_GENERAL any other Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #480	2015-12-11 16:19:56 -08:00
Olaf Faaland	692ae8d398	Add new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible recursive locking in some cases and possible circular locking dependency in others, within the SPL and ZFS modules. When lockdep detects these conditions, it disables further lock analysis for all locks. This causes /proc/lock_stats not to reflect full information about lock contention, even in locks without dependency issues. This commit creates a new type of mutex, MUTEX_NOLOCKDEP. This mutex type causes subsequent attempts to take or release those locks to be wrapped in lockdep_off() and lockdep_on(). This commit also creates an RW_NOLOCKDEP type analagous to MUTEX_NOLOCKDEP. MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to that repo, for userspace builds. Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #480	2015-12-11 16:18:54 -08:00
zgock	0da84d1574	Fix build issue on some configured kernels The SPL fails to build with some "Configured" kernels (ex. openSUSE xen Kernel) this change should make same binaries with C compiler optimization. Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #510	2015-12-11 15:27:53 -08:00
Brian Behlendorf	225c110675	Either _ILP32 or _LP64 must be defined For some arm, powerpc, and sparc platforms it was possible that neither _ILP32 of _LP64 would be defined. Update the isa_defs.h header to explicitly set these macros and generate a compile error in the case neither are defined. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com> Issue zfsonlinux/zfs#4048	2015-12-10 11:53:29 -08:00
Brian Behlendorf	c5a8b1e163	Revert "Make taskq_member() use ->journal_info" This reverts commit `a430c11f0b`. Using journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425! Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #500	2015-12-08 17:12:36 -08:00
Richard Yao	a430c11f0b	Make taskq_member() use ->journal_info The ->journal_info pointer in the task_struct is reserved for use by filesystems and because the kernel can have multiple file systems on the same stack due to direct reclaim, each filesystem that touches ->journal_info in a callback function will save the value at the start of its frame and restore it at the end of its frame. This allows us to safely use ->journal_info to store a pointer to the taskq's struct in taskq threads so that ZFS code paths can detect the presence of a taskq. This could break if the ZFS code were to use taskq_member from the context of direct reclaim. However, there are no such uses of it in that manner, so this is safe. This eliminates an O(N) list traversal under a spinlock with an O(1) unlocked pointer comparison. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: tuxoko <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #500	2015-12-08 13:24:47 -08:00
Richard Yao	1683e75edc	Fix race between getf() and areleasef() If a vnode is released asynchronously through areleasef(), it is possible for the user process to reuse the file descriptor before areleasef is called. When this happens, getf() will return a stale reference, any operations in the kernel on that file descriptor will fail (as it is closed) and the operations meant for that fd will never occur from userspace's perspective. We correct this by detecting this condition in getf(), doing a putf on the old file handle, updating the file descriptor and proceeding as if everything was fine. When the areleasef() is done, it will harmlessly decrement the reference counter on the Illumos file handle. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #492	2015-12-03 15:44:47 -08:00
Tim Chase	e5f9a9afd2	Additional dkio support for TRIM/Discard Replace DKIOCTRIM with DKIOCFREE and add additional support required for Nextenta's TRIM support. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #469	2015-12-02 13:44:35 -08:00
Jason Zaman	8fc851b7b5	sysmacros: Make P2ROUNDUP not trigger int overflow The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which triggers PaX's integer overflow detection for unsigned integers. Replace the macros with an equivalent version that does not trigger the overflow. Axioms: A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement. B. ~(x & y) === ((~(x)) \| (~(y))) under De Morgan's law. C. ~(~x) === x under the law of excluded middle. Proof: 0. (-(-(x) & -(align))) original 1. (~(-(x) & -(align)) + 1) by A 2. (((~(-(x))) \| (~(-(align)))) + 1) by B 3. (((~(~((x) - 1))) \| (~(~((align) - 1)))) + 1) by A 4. (((((x) - 1)) \| (((align) - 1))) + 1) by C Q.E.D. Signed-off-by: Jason Zaman <jason@perfinion.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#2505 Closes #488	2015-11-13 15:21:52 -08:00
Brian Behlendorf	9b13f65d28	Fix CPU hotplug Allocate a kmem cache magazine for every possible CPU which might be added to the system. This ensures that when one of these CPUs is enabled it can be safely used immediately. For many systems the number of online CPUs is identical to the number of present CPUs so this does imply an increased memory footprint. In fact, dynamically allocating the array of magazine pointers instead of using the worst case NR_CPUS can end up decreasing our memory footprint. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #482	2015-10-13 09:50:40 -07:00
Chunwei Chen	374303a3c9	Use tab indent in rwlock.h Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #473	2015-10-02 11:21:35 -07:00
Chunwei Chen	a00b3eb58f	rwsem use kernel provided owner when possible If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner field, so we don't need our own. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #473	2015-10-02 11:21:32 -07:00
Chunwei Chen	4f8e643afe	Don't take spin lock on rwlock owner The spin lock around rw_owner is completely unnecessary. The reason is that it is only modified in the down_write context. If you race against another thread modifying it, that means that you aren't holding the rwlock, so taking the spin lock don't eliminate the race. Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked is unnecessary and might need to take spin lock. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #473	2015-10-02 11:20:55 -07:00
Chunwei Chen	ae89cf0f34	Restructure uio to accommodate bio_vec Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use UIO_BVEC in segflg to determine which "vec". Also, to be consistent to newer kernel, we make iovec and bio_vec immutable, and make uio act as an iterator with the new uio_skip field indicating number of bytes to skip in the first segment. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#3511 Issue zfsonlinux/zfs#3640 Closes #468	2015-08-24 10:10:21 -07:00
Tim Chase	851a549305	Include other sources of freeable memory in the freemem calculation Prevents ARC collapse when non-ZFS filesystems, the block layer or other memory consumers use a lot of reclaimable memory in the page cache. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes zfsonlinux/zfs#3680 Closes #471	2015-08-19 09:25:30 -07:00
Brian Behlendorf	8ac6ffecaf	Remove needfree, desfree, lotsfree #defines This patch reverts `77ab5dd`. This is now possible because upstream has refactored the ARC in such a way that these values are only used in a few key places. Those places have subsequently been updated to use the Linux equivalent Linux functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#3637	2015-07-30 11:45:24 -07:00
Brian Behlendorf	9dc5ffbec8	Invert minclsyspri and maxclsyspri On Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. Note this change also reverts some of the priorities changes in prior commit `62aa81a`. The rational is as follows: spl_kmem_cache - High priority may result in blocked memory allocs spl_system_taskq - May perform I/O for file backed VDEVs spl_dynamic_taskq - New taskq threads should be spawned promptly Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue zfsonlinux/zfs#3607	2015-07-28 13:59:03 -07:00
Brian Behlendorf	62aa81a577	Add defclsyspri macro Add a new defclsyspri macro which can be used to request the default Linux scheduler priority. Neither the minclsyspri or maxclsyspri map to the default Linux kernel thread priority. This makes it awkward to create taskqs which run with the same priority as the rest of the kernel threads on the system which can lead to performance issues. All SPL callers which previously used minclsyspri or maxclsyspri have been changed to use defclsyspri. The vast majority of callers were part of the test suite which won't have an external impact. The few places where it could impact performance the change was from maxclsyspri to defclsyspri. This makes it more likely the process will be scheduled which may help performance. To facilitate further performance analysis the spl_taskq_thread_priority module option has been added. When disabled (0) all newly created kernel threads will use the default kernel thread priority. When enabled (1) the specified taskq priority will be used. By default this value is enabled (1). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-23 13:25:49 -07:00
Brian Behlendorf	77ab5dd33a	Add memory compatibility wrappers The function vmem_qcache_reap() and global variables 'needfree', 'desfree', and 'lotsfree' are all used in the upstream. While these variables have no meaning under Linux they're being defined as 0's to avoid needing to make additional changes to the ARC code. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-29 09:26:29 -07:00
Brian Behlendorf	f7a973d99b	Add TASKQ_DYNAMIC feature Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic semantics. Initially only a single worker thread will be created to service tasks dispatched to the queue. As additional threads are needed they will be dynamically spawned up to the max number specified by 'nthreads'. When the threads are no longer needed, because the taskq is empty, they will automatically terminate. Due to the low cost of creating and destroying threads under Linux by default new threads and spawned and terminated aggressively. There are two modules options which can be tuned to adjust this behavior if needed. * spl_taskq_thread_sequential - The number of sequential tasks, without interruption, which needed to be handled by a worker thread before a new worker thread is spawned. Default 4. * spl_taskq_thread_dynamic - Provides the ability to completely disable the use of dynamic taskqs on the system. This is provided for the purposes of debugging and troubleshooting. Default 1 (enabled). This behavior is fundamentally consistent with the dynamic taskq implementation found in both illumos and FreeBSD. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #458	2015-06-24 15:14:18 -07:00
Brian Behlendorf	5acb2307b2	Add IMPLY() and EQUIV() macros Added for upstream compatibility, they are of the form: * IMPLY(a, b) - if (a) then (b) * EQUIV(a, b) - if (a) then (b) AND if (b) then (a) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-24 14:44:47 -07:00
Brian Behlendorf	2345368646	Rename cv_wait_interruptible() to cv_wait_sig() Commit `f752b46e` added the cv_wait_interruptible() function to allow condition variables to be woken by signals. This function and its timed wait counterpart should have been named cv_wait_sig() to match the illumos interface which provides the same functionality. This patch renames the symbol but leaves a #define compatibility wrapper in place until the ZFS code can be moved to the correct name. This patch also makes a small number of cosmetic changes to make the condvar source and header cstyle clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #456	2015-06-10 16:36:12 -07:00
Brian Behlendorf	86c16c59fe	Retire rwsem_is_locked() compat Stock Linux 2.6.32 and earlier kernels contained a broken version of rwsem_is_locked() which could return an incorrect value. Because of this compatibility code was added to detect the broken implementation and replace it with our own if needed. The fix for this issue was merged in to the mainline Linux kernel as of 2.6.33 and the major enterprise distributions based on 2.6.32 have all backported the fix. Therefore there is no longer a need to carry this code and it can be removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #454	2015-06-10 16:35:48 -07:00
Chris Dunlop	a876b0305e	Make taskq_wait() block until the queue is empty Under Illumos taskq_wait() returns when there are no more tasks in the queue. This behavior differs from ZoL and FreeBSD where taskq_wait() returns when all the tasks in the queue at the beginning of the taskq_wait() call are complete. New tasks added whilst taskq_wait() is running will be ignored. This difference in semantics makes it possible that new subtle issues could be introduced when porting changes from Illumos. To avoid that possibility the taskq_wait() function is being updated such that it blocks until the queue in empty. The previous behavior remains available through the taskq_wait_outstanding() interface. Note that this function was previously called taskq_wait_all() but has been renamed to avoid confusion. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #455	2015-06-09 12:20:12 -07:00
Brian Behlendorf	dc5e8b7041	Add boot_ncpus macro For compatibility define boot_ncpus as num_online_cpus(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-05-21 09:58:01 -07:00
Richard Yao	d3c677bcd3	Implement areleasef() Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #449	2015-04-24 13:02:37 -07:00
Tim Chase	ae26dd0039	Don't allow shrinking a PF_FSTRANS context Avoid deadlocks when entering the shrinker from a PF_FSTRANS context. This patch also reverts commit `d0d5dd7` which added MUTEX_FSTRANS. Its use has been deprecated within ZFS as it was an ineffective mechanism to eliminate deadlocks. Among other things, it introduced the need for strict ordering of mutex locking and unlocking in order that the PF_FSTRANS flag wouldn't set incorrectly. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #446	2015-04-03 11:32:31 -07:00
Chris Dunlop	c089961110	Add crgetzoneid() stub Illumos 3897 introduces a dependency on crgetzoneid(). Stub it out until such time as zones are implemented. References: https://www.illumos.org/issues/3897 https://github.com/illumos/illumos-gate/commit/fb7001f Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #444	2015-04-02 09:49:55 -07:00
Tim Chase	79a0056e13	Add mutex_enter_nested() which maps to mutex_lock_nested() Also add support for the "name" parameter in mutex_init(). The name allows for better diagnostics, namely in /proc/lock_stats when lock debugging is enabled. Nested mutexes are necessary to support CONFIG_PROVE_LOCKING. ZoL can use mutex_enter_nested()'s "class" argument to to convey the locking hierarchy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #439	2015-03-20 13:53:31 -07:00
Brian Behlendorf	d0d5dd7144	Add MUTEX_FSTRANS mutex type There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS, support for which was added to the SPL in commit `9099312` - Merge branch 'kmem-rework'. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:18:24 -08:00
Brian Behlendorf	5f920fbee1	Retire MUTEX_OWNER checks To minimize the size of a kmutex_t a MUTEX_OWNER check was added. It allowed the kmutex_t wrapper to leverage the mutex owner which was already stored in the mutex for certain kernel configurations. The upside to this was that it reduced the size of the kmutex_t wrapper structure by the size of a task_struct pointer (4/8 bytes). The downside was that two mutex implementations needed to be maintained. Depending on your exact kernel configuration the correct one would be selected. Over the years this solution worked but it could be fragile since it depending heavily on assumed kernel mutex implementation details. For example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to be added when the kernel changed how the owner was stored. It also made the code more complicated than it needed to be. Therefore, in the name of simplicity and portability this optimization is being retired. It will slightly increase the memory requirements for a kmutex_t but only very slightly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:13:33 -08:00
Brian Behlendorf	a900e28e71	Fix cstyle issue in mutex.h This patch only addresses the issues identified by the style checker in mutex.h. It contains no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #435	2015-03-03 10:13:25 -08:00
Brian Behlendorf	ee33517452	Use __get_free_pages() for emergency objects The __get_free_pages() function must be used in place of kmalloc() to ensure the __GFP_COMP is strictly honored. This is due to kmalloc() being layered on the generic Linux slab caches. It wasn't until recently that all caches were created using __GFP_COMP. This means that it is possible for a kmalloc() which passed the __GFP_COMP flag to be returned a non-compound allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:58:11 -08:00
Brian Behlendorf	3018bffa9b	Refine slab cache sizing This change is designed to improve the memory utilization of slabs by more carefully setting their size. The way the code currently works is problematic for slabs which contain large objects (>1MB). This is due to slabs being unconditionally rounded up to a power of two which may result in unused space at the end of the slab. The reason the existing code rounds up every slab is because it assumes it will backed by the buddy allocator. Since the buddy allocator can only performs power of two allocations this is desirable because it avoids wasting any space. However, this logic breaks down if slab is backed by vmalloc() which operates at a page level granularity. In this case, the optimal thing to do is calculate the minimum required slab size given certain constraints (object size, alignment, objects/slab, etc). Therefore, this patch reworks the spl_slab_size() function so that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs. KMC_KMEM slabs are rounded up to the nearest power of two, and KMC_VMEM slabs are allowed to be the minimum required size. This change also reduces the default number of objects per slab. This reduces how much memory a single cache object can pin, which can result in significant memory saving for highly fragmented caches. But depending on the workload it may result in slabs being allocated and freed more frequently. In practice, this has been shown to be a better default for most workloads. Also the maximum slab size has been reduced to 4MB on 32-bit systems. Due to the limited virtual address space it's critical the we be as frugal as possible. A limit of 4M still lets us reasonably comfortably allocate a limited number of 1MB objects. Finally, the kmem:slab_small and kmem:slab_large SPLAT tests were extended to provide better test coverage of various object sizes and alignments. Caches are created with random parameters and their basic functionality is verified by allocating several slabs worth of objects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Richard Yao	c2fa09454e	Add hooks for disabling direct reclaim The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark contexts which are processing transactions. When set, allocations in this context can dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation. Additionally, when using a Linux 3.9 or newer kernel, it will set PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on on any KM_PUSHPAGE or KM_NOSLEEP allocation. This effectively allows the spl_vmalloc() helper function to be used safely in a thread which is responsible for IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	c3eabc75b1	Refactor generic memory allocation interfaces This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to the newly-added spl_kmem_alloc_max parameter. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management strategies employed by the respective kernels. However, this functionally provides the semantics required. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely use KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers but there are a few caveats: 1. Because the headers were refactored and extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	b34b95635a	Fix kmem cstyle issues Address all cstyle issues in the kmem, vmem, and kmem_cache source and headers. This will done to make it easier to review subsequent changes which will rework the kmem/vmem implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:09 -08:00
Brian Behlendorf	e5b9b344c7	Refactor existing code This change introduces no functional changes to the memory management interfaces. It only restructures the existing codes by separating the kmem, vmem, and kmem cache implementations in the separate source and header files. Splitting this functionality in to separate files required the addition of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions. Additionally, several minor changes to the #include's were required to accommodate the removal of extraneous header from kmem.h. But again, while large this patch introduces no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:08 -08:00
Richard Yao	6ecf6d7228	Revert "Add PF_NOFS debugging flag" This reverts commit `eb0f407a2b` in preperation for updating the kmem/vmem infrastructure to use the PF_FSTRANS flag. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-01-16 13:55:08 -08:00
Tim Chase	47af4b76ff	Use current_kernel_time() in the time compatibility wrappers Since the Linux kernel's utimens family of functions uses current_kernel_time(), we need to do the same in the context of ZFS or else there can be discrepencies in timestamps (they go backward) if userland code does: fd = creat(FNAME, 0600); (void) futimens(fd, NULL); The getnstimeofday() function generally returns a slightly lower time value. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#3006	2015-01-16 13:54:35 -08:00
Chunwei Chen	a3c1eb7772	mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the zfsonlinux/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: zfsonlinux/zfs#401 zfsonlinux/zfs#2523 zfsonlinux/zfs#2679 zfsonlinux/zfs#2684 zfsonlinux/zfs#2704 zfsonlinux/zfs#2708 zfsonlinux/zfs#2517 zfsonlinux/zfs#2827 zfsonlinux/zfs#2850 zfsonlinux/zfs#2891 zfsonlinux/zfs#2897 zfsonlinux/zfs#2247 zfsonlinux/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421	2014-12-19 10:18:47 -08:00
Ned Bass	52479ecf58	Remove compat includes from sys/types.h Don't include the compatibility code in linux/_compat.h in the public header sys/types.h. This causes problems when an external code base includes the ZFS headers and has its own conflicting compatibility code. Lustre, in particular, defined SHRINK_STOP for compatibility with pre-3.12 kernels in a way that conflicted with the SPL's definition. Because Lustre ZFS OSD includes ZFS headers it fails to build due to a '"SHRINK_STOP" redefined' compiler warning. To avoid such conflicts only include the compat headers from .c files or private headers. Also, for consistency, include sys/.h before linux/*.h then sort by header name. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #411	2014-11-19 10:35:12 -08:00
Brian Behlendorf	8d9a23e82c	Retire legacy debugging infrastructure When the SPL was originally written Linux tracepoints were still in their infancy. Therefore, an entire debugging subsystem was added to facilite tracing which served us well for many years. Now that Linux tracepoints have matured they provide all the functionality of the previous tracing subsystem. Rather than maintain parallel functionality it makes sense to fully adopt tracepoints. Therefore, this patch retires the legacy debugging infrastructure. See zfsonlinux/zfs@bc9f413 for the tracepoint changes. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #408	2014-11-19 10:35:07 -08:00

1 2 3 4 5 ...

429 Commits