mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Richard Yao	f2a745c41d	Linux 3.10 compat: Do not rely on struct proc_dir_entry definition Linux kernel commit torvalds/linux#59d8053f moved the definition of struct proc_dir_entry from include/linux/proc_fs.h to the private header fs/proc/internal.h. The SPL relied on that to map Solaris' kstat to entries in /proc/spl/kstat. Since the proc_dir_entry structure is now private the only safe thing to do is wrap the opaque proc handle with our own structure. This actually ends up simplify the code and is good because it moves us away from depending on implementation details of /proc. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #257	2013-07-08 15:25:18 -07:00
Yuxuan Shui	c02ab72fb9	Linux 3.10 compat: struct vmalloc_info moved Linux kernel commmit torvalds/linux@db3808c1 moved the vmalloc_info structure from a private to a public header. Now that it's available for kernel modules use it. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #257	2013-07-08 15:09:20 -07:00
Brian Behlendorf	ab0fdfef52	Fix ASSERT0 and VERIFY0 macro typo Ensure the value is cast to a 'long long' for printing purposes. The expectation is that ASSERT0/VERIFY0 are mostly used for validating return values and thus may commonly be negative. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #246	2013-06-21 15:38:46 -07:00
Brian Behlendorf	1c6d149feb	Add ASSERT0 and VERIFY0 macros The Illumos code introduced the ASSERT0 and VERIFY0 macros which are to be used instead of ASSERT3S(x, ==, 0) and VERIFY3S(x, ==, 0). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Madhav Suresh <madhav.suresh@delphix.com> Closes #246	2013-06-18 11:41:55 -07:00
Brian Behlendorf	ab59be7bc7	Fix delay() Somewhat amazingly it went unnoticed that the delay() function doesn't actually cause the task to block. Since the task state is never changed from TASK_RUNNING before schedule_timeout() the scheduler allows to task to continue running without any delay. Using schedule_timeout_interruptible() resolves the issue by correctly setting TASK_UNINTERRUPTIBLE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-05-01 16:35:47 -07:00
Brian Behlendorf	f6437b60c2	Add msec/usec/nsec to tick convertors Add wrappers for the Solaris MSEC_TO_TICK, USEC_TO_TICK, and NSEC_TO_TICK conversion functions. They are mapped directly to their Linux counterparts with the exception of NSEC_TO_TICK can cannot use usecs_to_jiffies() because it is not exported by the kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-05-01 12:07:56 -07:00
Brian Behlendorf	4a6d8d2c3e	Change spl-kmod-devel install path Install the common spl kernel development headers under /usr/src/spl-<version>/ rather than in a kernel specific directory. The kernel specific build products such as spl_config.h and Modules.symvers are left installed under /usr/src/spl-<version>/<kernel>. This was done to be consistent with where dkms expects kernel module source to be packaged. It also allows for a common spl-kmod-devel package which includes the headers, and per-kernel spl-kmod-devel-<kernel> packages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 12:01:05 -07:00
Richard Yao	10087fe1fa	Linux 3.9 compat: Include linux/sched/rt.h Linux 3.9 reorganized sched.h, splitting it into numerous files. torvalds/linux@8bd75c77b7 moved MAX_PRIO and MAX_RT_PRIO to linux/sched/rt.h. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-14 10:43:19 -07:00
Ned Bass	3d6af2dd6d	Refresh links to web site Update links to refer to the official ZFS on Linux website instead of @behlendorf's personal fork on github. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-04 19:09:34 -08:00
Brian Behlendorf	d1142fbffe	Remove custom install-data-local for headers Rather than use a custom install target it is cleaner to define a 'kerneldir' and set 'kernel_HEADERS' appropriately. This allows us to leverage the standing configure install support. Additionally, I took this opertunity add the missing make files to the include subdirectories. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-01 16:55:06 -08:00
Brian Behlendorf	fea77534f0	Fix spl_config.h install permissions The default permissions used by install are 755. Since this file isn't executable 644 is more appropriate. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-03-01 16:55:06 -08:00
Eric Dillmann	3cbfd259b7	Define BE_IN16 & BE_IN32 for lz4 compression The new lz4 compression algorithm, zfsonlinux/zfs@9759c60, requires the generic BE_IN16 and BE_IN32 functions. These are added to the SPL for other consumers to take advantage of. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-29 09:30:23 -08:00
Brian Behlendorf	0936c3449f	Add spl_kmem_cache_expire module option Cache aging was implemented because it was part of the default Solaris kmem_cache behavior. The idea is that per-cpu objects which haven't been accessed in several seconds should be returned to the cache. On the other hand Linux slabs never move objects back to the slabs unless there is memory pressure on the system. This behavior is now configurable through the 'spl_kmem_cache_expire' module option. The value is a bit mask with the following meaning. 0x1 - Solaris style cache aging eviction is enabled. 0x2 - Linux style low memory eviction is enabled. Both methods may be safely enabled simultaneously, but by default both are disabled. It has never been clear if the kmem cache aging (which has been around from day one) actually does any good. It has however been the source of numerous bugs so I wouldn't mind retiring it entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#1227 Closes #210	2013-01-28 09:34:12 -08:00
Brian Behlendorf	84dd1f4f15	Remove spl_invalidate_inodes() This functionality is no longer required by ZFS, see commit zfsonlinux/zfs@7b3e34ba5a. Since there are no other consumers, and because it adds additional autoconf complexity which must be maintained the spl_invalidate_inodes() function has been removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#795	2013-01-17 11:40:47 -08:00
Brian Behlendorf	1c7b3eaf87	RHEL 6.4 compat, fallocate() In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was introduced after the fallocate() function was moved from the inode_operations to the file_operations structure. Therefore, the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined it was safe to use f_ops->fallocate(). Unfortunately, the RHEL6.4 kernel has only backported the FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change. To address this compatibility issue the spl_filp_fallocate() helper function was added to properly detect which interface is available. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2013-01-08 09:53:13 -08:00
Matt Johnston	46a75aadb7	Add cv_wait_io() to account I/O time Under Linux when a task is waiting on I/O it should call the io_schedule() function for proper accounting. The Solaris cv_wait() function provides no way to specify what the cv is waiting on therefore cv_wait_io() is introduced. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #206	2013-01-07 10:29:26 -08:00
Brian Behlendorf	034f1b331e	Fix spl_kmem_init_kallsyms_lookup() panic Due to I/O buffering the helper may return successfully before the proc handler has a chance to execute. To catch this case wait up to 1 second to verify spl_kallsyms_lookup_name_fn was updated to a non SYMBOL_POISON value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/zfs#699 Closes zfsonlinux/zfs#859	2012-12-19 09:06:35 -08:00
Brian Behlendorf	eb0be2ed46	Removed SPL_AC_3ARGS_INIT_WORK check All consumers of the kernel delayed work queues have been shifted over to rely on the taskq implementation. This compatibility code can now be removed. Any new callers which need this functionality should use the taskq interfaces for delayed work items. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:57:10 -08:00
Brian Behlendorf	33e94ef1dd	kmem-cache: Use a taskq for async allocations Shift the asynchronous allocations over to use the taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. This code never actually used the delay functionality it was just done this way to leverage the existing compatibility code. All that is required is a thread context to perform the allocation in. The only thing clever in this change is that we take advantage of the preallocated task queue entries to avoid a memory allocation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	a10287e00d	kmem-cache: Use taskqs for ageing Shift the cache and magazine ageing functionality over to the new delayed taskq interfaces. This allows us to abandon the kernels delayed work queue interface and all the compatibility code it requires. However, the delayed taskq interface does not allow us to schedule a task for a specfic cpu so the ageing code was slightly reworked. The magazine ageing delay has been directly linked to the cache ageing function. The spl_cache_age() function invokes on_each_cpu() in order to run spl_magazine_age() on each cpu. It then blocks waiting for them to complete and promptly reclaims any free slabs. When restructing the code wasn't the primary goal I think the new code is far more understable and maintainable. It also should help minimize magazine thrashing because free slabs are immediately released after the magazine is aged. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:56:54 -08:00
Brian Behlendorf	d9acd930b5	taskq delay/cancel functionality Add the ability to dispatch a delayed task to a taskq. The desired behavior is for the task to be queued but not executed by a worker thread until the expiration time is reached. To achieve this two new functions were added. * taskq_dispatch_delay() - This function behaves exactly like taskq_dispatch() however it takes a third 'expire_time' argument. The caller should pass the desired time the task should be executed as an absolute value in jiffies. The task is guarenteed not to run before this time, it may run slightly latter if all the worker threads are busy. * taskq_cancel_id() - Given a task id attempt to cancel the task before it gets executed. This is primarily useful for canceling delay tasks but can be used for canceling any previously dispatched task. There are three possible return values. 0 - The task was found and canceled before it was executed. ENOENT - The task was not found, either it was already run or an invalid task id was supplied by the caller. EBUSY - The task is currently executing any may not be canceled. This function will block until the task has been completed. * taskq_wait_all() - The taskq_wait_id() function was renamed taskq_wait_all() to more clearly reflect its actual behavior. It is only curreny used by the splat taskq regression tests. * taskq_wait_id() - Historically, the only difference between this function and taskq_wait() was that you passed the task id. In both functions you would block until ALL lower task ids which executed. This was semantically correct but could be very slow particularly if there were delay tasks submitted. To better accomidate the delay tasks this function was reimplemnted. It will now only block until the passed task id has been completed. This is actually a fairly low risk change for a few reasons. * Only new ZFS callers will make use of the new interfaces and very little common code was changed to support the new functions. * The existing taskq_wait() implementation was not changed just slightly refactored. * The newly optimized taskq_wait_id() implementation was never used by ZFS we can't accidentally introduce a new bug there. NOTE: This functionality does not exist in the Illumos taskqs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	aed8671cb0	taskq style, remove #define wrappers When the taskq implementation was originally written I wrapped all the API functions in #define's. This was done as a preventative measure to ensure that a taskq symbol never conflicted with an existing kernel symbol. However, in practice the taskq symbols never conflicted. The only major conflicts occured with the kmem cache API. Since this added layer of obfuscation never bought us anything for the taskq's I'm removing it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	472a34caff	taskq style, convert spaces to soft tabs Update the taskq implementation to conform with the style used throughout the rest of the code. There are no functional changes in this commit. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-12-12 09:54:07 -08:00
Brian Behlendorf	ed3163484d	Track emergency object in rbtree In the initial implementation emergency objects were tracked on a per-cache list. The assumption was that under normal operation we would never allocate more than a handful of these objects. So the cost of walking the list during free was expected to be negligible. However real world usage has shown that emergency objects tend to be allocated in batches. A deadlock will be detected and several thousand emergency objects will be allocated before the original blocked slab allocation can complete. Therefore the original list has been replaced by a red black tree which is sorted by the memory address of each allocated object. This bounds the worst case insertion and removal time to O(log n) which minimize contention on the assoicated spin lock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:19 -08:00
Brian Behlendorf	165f13c33a	Improved vmem cached deadlock detection The entire goal of performing the slab allocations asynchronously is to be able to detect when a vmalloc() deadlocks. In this case, and only this case, do we want to start allocating emergency objects. The trick here is to minimize false positives because the overhead of tracking emergency objects is far higher than normal slab objects. With that goal in mind the code was reworked to be less sensitive to slow allocations by increasing the wait time. Once a cache is is marked deadlocked all subsequent allocations which can not be satisfied with existing cache objects will immediately allocate new emergency objects. This behavior persists until the asynchronous allocation completes and clears the deadlocked flag. The result of these tweaks is that far fewer emergency objects get created which is important because this minimizes the cost of releasing them latter in kmem_cache_free(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:54:15 -08:00
Brian Behlendorf	df870a697f	splat: Cleanup headers Restructure the the SPLAT headers such that each test only includes the minimal set of headers it requires. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:56 -08:00
Brian Behlendorf	d2733258d0	Condition variable reference counts Reference count every entry and exit from the condition variable functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast(). This allows us to safely block in cv_destroy() until all consumers have been scheduled and are no longer accessing the condition variable memory. In addition poison the magic value at the start of cv_destroy() to ensure there are never any new callers after cv_destroy() is called. The consumer is responsible for ensuring this never occurs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-06 14:48:55 -08:00
Brian Behlendorf	dba79fcbf2	Add KSTAT_TYPE_TXG type Add a new kstat type for tracking useful statistics about a TXG. The new KSTAT_TYPE_TXG type can be used to tracks the following statistics per-txg. txg - Unique txg number state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted birth; - Creation time nread - Bytes read nwritten; - Bytes written reads - IOPs read writes - IOPs write open_time; - Length in nanoseconds the txg was open quiesce_time - Length in nanoseconds the txg was quiescing sync_time; - Length in nanoseconds the txg was syncing Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-11-02 15:17:40 -07:00
Brian Behlendorf	71c9f0b003	Make kstat.ks_update() callback atomic Move the kstat ks_update() callback under the ks_lock. This enables dynamically sized kstats without modification to the kstat API. * Create a kstat with the KSTAT_FLAG_VIRTUAL flag. * Register a ->ks_update() callback which does: o Frees any existing ks_data buffer. o Set ks_data_size to the kstat array size. o Set ks_data to an allocated buffer of size ks_data_size o Populate the array of buffers with the required data. The buffer allocated in the ks_update() callback is guaranteed to remain allocated and valid while the proc sequence handler iterates over the buffer. The lock will not be dropped until kstat_seq_stop() function is run making it safe for concurrent access. To allow the ks_update() callback to perform memory allocations the lock was changed to a mutex. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-10-23 09:36:19 -07:00
Brian Behlendorf	1e0c2c2ccf	Linux 3.7 compat, __clear_close_on_exec() removed Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec() function out of include/linux/fdtable.h and in to fs/file.c making it unavailable to the SPL. Now as it turns out we only used this function to tear down some test infrastructure for the vn_getf()/vn_releasef() SPLAT regression tests. Rather than implement even more autoconf compatibilty code to handle this we just remove the test case. This also allows us to drop three existing autoconf tests. This does mean the SPLAT tests will no longer verify these functions but historically they have never been a problem. And if we feel we absolutely need this test coverage I'm sure a more portable version of the test case could be added. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #183	2012-10-18 13:36:44 -07:00
Yuxuan Shui	bcb15891ab	Linux 3.6 compat, kern_path_locked() added The kern_path_parent() function was removed from Linux 3.6 because it was observed that all the callers just want the parent dentry. The simpler kern_path_locked() function replaces kern_path_parent() and does the lookup while holding the ->i_mutex lock. This is good news for the vn implementation because it removes the need for us to handle the locking. However, it makes it harder to implement a single readable vn_remove()/vn_rename() function which is usually what we prefer. Therefore, we implement a new version of vn_remove()/vn_rename() for Linux 3.6 and newer kernels. This allows us to leave the existing working implementation untouched, and to add a simpler version for newer kernels. Long term I would very much like to see all of the vn code removed since what this code enabled is generally frowned upon in the kernel. But that can't happen util we either abondon the zpool.cache file or implement alternate infrastructure to update is correctly in user space. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #154	2012-10-14 16:26:21 -07:00
Etienne Dechamps	bbdc6ae495	Add interface for file hole punching. This adds an interface to "punch holes" (deallocate space) in VFS files. The interface is identical to the Solaris VOP_SPACE interface. This interface is necessary for TRIM support on file vdevs. This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which was introduced in 2.6.38. For a brief time before 2.6.38 this was done using the truncate_range inode operation, which was quickly deprecated. This patch only supports FALLOC_FL_PUNCH_HOLE. This adds support for the truncate_range() inode operation to VOP_SPACE() for file hole punching. This API is deprecated and removed in 3.5, so it's only useful for old kernels. On tmpfs, the truncate_range() inode operation translates to shmem_truncate_range(). Unfortunately, this function expects the end offset to be inclusive and aligned to the end of a page. If it is not, the kernel will stop with a BUG_ON(). This patch fixes the issue by adapting to the constraints set forth by shmem_truncate_range(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #168	2012-10-04 16:22:07 -07:00
Brian Behlendorf	9b51f21841	Remove TQ_SLEEP -> KM_SLEEP mapping When the taskq code was originally written it seemed like a good idea to simply map TQ_SLEEP to KM_SLEEP. Unfortunately, this assumed that the TQ_* flags would never confict with any of the Linux GFP_* flags. When adding the TQ_PUSHPAGE support in commit `cd5ca4b` this invariant was accidentally broken. Therefore to support TQ_PUSHPAGE, which is needed for Linux, and prevent any further confusion I have removed this direct mapping. The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined in terms of their KM_* counterparts. Instead a simple mapping function is introduce to convert TQ_* -> KM_* where needed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #171	2012-09-12 11:41:42 -07:00
Brian Behlendorf	330fe010e4	Revert "Switch KM_SLEEP to KM_PUSHPAGE" This reverts commit `cd5ca4b2f8` due to conflicts in the higher TQ_ bits which caused incorrect behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-12 10:07:48 -07:00
Brian Behlendorf	cb5c2acebb	Add KMC_NOEMERGENCY slab flag Provide a flag to disable the use of emergency objects for a specific kmem cache. There may be instances where under no circumstances should you kmalloc() an emergency object. For example, when you cache contains very large objects (>128k). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-09-07 14:27:03 -07:00
Etienne Dechamps	ac8ca67a88	Add DKIOCTRIM for TRIM support. See dechamps/zfs@cc6cd40ad7 for details. This harmless addition was merged to simplify testing the ZFS TRIM support patches. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #167	2012-09-02 14:22:01 -07:00
Brian Behlendorf	cd5ca4b2f8	Switch KM_SLEEP to KM_PUSHPAGE Under certain circumstances the following functions may be called in a context where KM_SLEEP is unsafe and can result in a deadlocked system. To avoid this problem the unconditional KM_SLEEPs are converted to KM_PUSHPAGEs. This will prevent them from attempting to initiate any I/O during direct reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	3e904f40b4	Mutex ASSERT on self deadlock Generate an assertion if we're going to deadlock the system by attempting to acquire a mutex the process is already holding. There are currently no known instances of this under normal operation, but it _might_ be possible when using a ZVOL as a swap device. I want to ensure we catch this immediately if it were to occur. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	eb0f407a2b	Add PF_NOFS debugging flag PF_NOFS is a per-process debug flag which is set in current->flags to detect when a process is performing an unsafe allocation. All tasks with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because if they enter direct reclaim and initiate I/O they may deadlock. When debugging is disabled, any incorrect usage will be detected and a call stack with a warning will be printed to the console. The flags will then be automatically corrected to allow for safe execution. If debugging is enabled this will be treated as a fatal condition. To avoid any risk of conflicting with the existing PF_ flags. The PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit. Only when CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused, will the PF_NOFS bit be valid. Happily, most existing distributions ship a kernel with CONFIG_RT_MUTEX_TESTER disabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:55 -07:00
Brian Behlendorf	d47e664ad4	Revert "Add TASKQ_NORECLAIM flag" This reverts commit `372c257233`. The use of the PF_MEMALLOC flag was always a hack to work around memory reclaim deadlocks. Those issues are believed to be resolved so this workaround can be safely reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Brian Behlendorf	e2dcc6e2b8	Emergency slab objects This patch is designed to resolve a deadlock which can occur with __vmalloc() based slabs. The issue is that the Linux kernel does not honor the flags passed to __vmalloc(). This makes it unsafe to use in a writeback context. Unfortunately, this is a use case ZFS depends on for correct operation. Fixing this issue in the upstream kernel was pursued and patches are available which resolve the issue. https://bugs.gentoo.org/show_bug.cgi?id=416685 However, these changes were rejected because upstream felt that using __vmalloc() in the context of writeback should never be done. Their solution was for us to rewrite parts of ZFS to accomidate the Linux VM. While that is probably the right long term solution, and it is something we want to pursue, it is not a trivial task and will likely destabilize the existing code. This work has been planned for the 0.7.0 release but in the meanwhile we want to improve the SPL slab implementation to accomidate this expected ZFS usage. This is accomplished by performing the __vmalloc() asynchronously in the context of a work queue. This doesn't prevent the posibility of the worker thread from deadlocking. However, the caller can now safely block on a wait queue for the slab allocation to complete. Normally this will occur in a reasonable amount of time and the caller will be woken up when the new slab is available,. The objects will then get cached in the per-cpu magazines and everything will proceed as usual. However, if the __vmalloc() deadlocks for the reasons described above, or is just very slow, then the callers on the wait queues will timeout out. When this rare situation occurs they will attempt to kmalloc() a single minimally sized object using the GFP_NOIO flags. This allocation will not deadlock because kmalloc() will honor the passed flags and the caller will be able to make forward progress. As long as forward progress can be maintained then even if the worker thread is deadlocked the critical thread will make progress. This will eventually allow the deadlocked worker thread to complete and normal operation will resume. These emergency allocations will likely be slow since they require contiguous pages. However, their use should be rare so the impact is expected to be minimal. If that turns out not to be the case in practice further optimizations are possible. One additional concern is if these emergency objects are long lived. Right now they are simply tracked on a list which must be walked when an object is freed. Is they accumulate on a system and the list grows freeing objects will become more expensive. This could be handled relatively easily by using a hash instead of a list, but that optimization (if needed) is left for a follow up patch. Additionally, these emeregency objects could be repacked in to existing slabs as objects are freed if the kmem_cache_set_move() functionality was implemented. See issue https://github.com/zfsonlinux/spl/issues/26 for full details. This work would also help reduce ZFS's memory fragmentation problems. The /proc/spl/kmem/slab file has had two new columns added at the end. The 'emerg' column reports the current number of these emergency objects in use for the cache, and the following 'max' column shows the historical worst case. These value should give us a good idea of how often these objects are needed. Based on these values under real use cases we can tune the default behavior. Lastly, as a side benefit using a single work queue for the slab allocations should reduce cpu contention on the global virtual address space lock. This should manifest itself as reduced cpu usage for the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-08-27 12:00:42 -07:00
Brian Behlendorf	c638e9ad04	Remove autotools products Remove all of the generated autotools products from the repository and update the .gitignore files accordingly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#718	2012-08-27 11:46:23 -07:00
Prakash Surya	45324c7c41	Add kpreempt_[dis\|en]able macros in <sys/disp.h> To support preempt enabled kernels in ZFS on Linux, there are a couple places where the ZFS code needs to disable interrupts. This change adds the Solaris preempt functions and maps them to the equivalent ZFS functions, allowing the ZFS to make use of them. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #98	2012-08-24 15:18:38 -07:00
Prakash Surya	08850eddcb	Avoid calling smp_processor_id in spl_magazine_age The spl_magazine_age function had the implied assumption that it will remain on its current cpu through its execution. In order to support preempt enabled kernels, this assumption had to be removed. The spl_kmem_magazine structure now holds the cpu id of the cpu it is local to. This allows spl_magazine_age to use this field when scheduling work to be done by the magazine's local cpu. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #98	2012-08-24 09:43:22 -07:00
Richard Yao	15d0411297	Remove Makefile from non-toplevel .gitignore files When building SPL support into the kernel, ./copy-builtin will copy non-toplevel .gitignore files. These files list /Makefile, which causes git-archive to omit ./module/{spl,splat}/Makefile. The absence of these files result in build failures when SPL is selected. ZFS is unaffected because it puts Makefile in the toplevel .gitignore, which is not copied. We fix SPL by emulating that behavior. Reported-by: Fabio Erculiani <lxnay@gentoo.org> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #152	2012-08-23 12:49:04 -07:00
Richard Yao	6576a1a70d	Fix incorrect type in spl_kmem_cache_set_move() parameter A preprocessor definition renders this harmless. However, it is a good idea to change this to be consistent. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>	2012-08-01 16:35:18 -07:00
Brian Behlendorf	d503b971f4	Optimize spl_rwsem_is_locked() The spl_rwsem_is_locked() compatibility function has been observed to be a hot spot. The root cause of this is that we must check the rwsem activity under the rwsem->wait_lock to avoid a race. When the lock is busy significant contention can occur. The upstream kernel fix for this race had the insight that by using spin_trylock_irqsave() this contention could be avoided. When the lock is contended it's reasonable to return that it is locked. This change updates the SPLs implemention to be like the upstream kernel. Since the kernel code has been in use for years now this a low risk change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-07-13 13:07:39 -07:00
Richard Yao	973e8269bd	Constify memory management functions This prevents warnings in ZFS that were caused by changes necessary to support PaX patched kernels. When debugging is enabled, these warnings become build failures. Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #131	2012-07-03 16:07:27 -07:00
Brian Behlendorf	44e406d712	PowerPC Compatibility Usage of get_current() is not supported across all architectures. The correct interface to use is the '#define current' which will map to the appropriate function, usually current_thread_info(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #119	2012-07-02 09:33:09 -07:00
Richard Yao	e0093fea58	Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR torvalds/linux@1dce27c5aa introduced __clear_close_on_exec() as a replacement for FD_CLR. Further commits appear to have removed FD_CLR from the Linux source tree. This causes the following failure: error: implicit declaration of function '__FD_CLR' [-Werror=implicit-function-declaration] To correct this we update the code to use the current __clear_close_on_exec() interface for readability. Then we introduce an autotools check to determine if __clear_close_on_exec() is available. If it isn't then we define some compatibility logic which used the older FD_CLR() interface. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #124	2012-06-13 16:18:51 -07:00
Brian Behlendorf	38d31a1e57	Remove Solaris module emulation Originally I believed that these interfaces would be needed. However, in practice it turned out that it was more straight forward and maintainable to use the native Linux interfaces. As such, this is all dead code and can be safely removed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #109	2012-05-18 13:57:44 -07:00
Richard Yao	f90096c905	Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS. This was done to prevent situations where filesystem operations which are holding locks enter direct reclaim and attempt to reaquire those same locks. This clearly will result in a deadlock. This works for datasets which are implemented in terms for filesystem operations. But unfortunately, swapping to a zvol will encounter many of the same deadlocks and GFP_NOFS will not prevent this. As such, it is appropriate to extend KM_PUSHPAGE to use the broader GFP_NOIO mask to handle these non-filesystem cases. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue zfsonlinux/zfs#342 Closes #105	2012-05-07 12:05:27 -07:00
Prakash Surya	cef7605c34	Throttle number of freed slabs based on nr_to_scan Previously, the SPL tried to maintain Solaris semantics by freeing all available (empty) slabs from its slab caches when the shrinker was called. This is not desirable when running on Linux. To make the SPL shrinker more Linux friendly, the actual number of freed slabs from each of the slab caches is now derived from nr_to_scan and skc_slab_objs. Additionally, an accounting bug was fixed in spl_slab_reclaim() which could cause us to reclaim one more slab than requested. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #101	2012-05-07 11:46:15 -07:00
Jorgen Lundman	cb75844e85	Define the needed ISA types for ARM Add the minimum required ISA types to support the ARM architecture. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-05-03 09:56:15 -07:00
Brian Behlendorf	b29012b999	Remove condition variable names Long ago I added support to the spl for condition variable names because I thought they might be needed. It turns out they aren't. In fact the official Solaris cv_init(9F) man page discourages their use in the kernel. cv_init(9F) Parameters name - Descriptive string. This is obsolete and should be NULL. (Non-NULL strings are legal, but they're a waste of kernel memory.) Therefore, I'm removing them from the spl to reclaim this memory and adding an ASSERT() to ensure no new consumers are added which make use of the name. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-06 12:06:19 -07:00
Brian Behlendorf	0835057ee7	Add SPL_META_RELEASE to module load/unload messages Include the ZFS_META_RELEASE in the module load/unload messages to more clearly indicate exactly what version of the SPL has been loaded. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-03-23 12:11:50 -07:00
Brian Behlendorf	3c208a5480	Cleanly support debug packages Allow a source rpm to be rebuilt with debugging enabled. This avoids the need to have to manually modify the spec file. By default debugging is still largely disabled. To enable specific debugging features use the following options with rpmbuild. '--with debug' - Enables ASSERTs '--with debug-log' - Enables the internal debug log '--with debug-kmem' - Enables basic memory accounting '--with debug-kmem-tracking' - Enables detailed memory tracking # For example: $ rpmbuild --rebuild --with debug spl-modules-0.6.0-rc6.src.rpm Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 14:24:22 -08:00
Brian Behlendorf	feedc43601	Add missing spl_debug_* helpers When building the spl with --disable-debug-log the __SDEBUG() macro and spl_debug_* helper functions were undefined. This change adds the missing functions so the upper layers compiling against the spl don't need to be aware of how the spl was built. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-09 16:41:46 -08:00
Brian Behlendorf	9a8b7a7458	Add basic dynamic kstat support Add the bare minimum functionality to support dynamic kstats. A complete kstat implementation should be done as part of issue #84. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #84	2012-02-02 11:28:00 -08:00
Brian Behlendorf	4b2220f0b9	Add --enable-debug-log configure option Until now the notion of an internal debug logging infrastructure was conflated with enabling ASSERT()s. This patch clarifies things by cleanly breaking the two subsystem apart. The result of this is the following behavior. --enable-debug - Enable/disable code wrapped in ASSERT()s. --disable-debug ASSERT()s are used to check invariants and are never required for correct operation. They are disabled by default because they may impact performance. --enable-debug-log - Enable/disable the debug log infrastructure. --disable-debug-log This infrastructure allows the spl code and its consumer to log messages to an in-kernel log. The granularity of the logging can be controlled by a debug mask. By default the mask disables most debug messages resulting in a negligible performance impact. Because of this the debug log is enabled by default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-02 11:27:54 -08:00
Brian Behlendorf	a2eda2ff48	Add the release component to headers When the original build system code was added the release component was accidentally omited from the development header install path. This patch adds the missing path component so it's always clear exactly what release your compiling against. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-01-18 11:06:26 -08:00
Darik Horn	588d900433	Linux 3.2 compat: rw_semaphore.wait_lock is raw The wait_lock member of the rw_semaphore struct became a raw_spinlock_t in Linux 3.2 at torvalds/linux@ddb6c9b58a. Wrap spin_lock_* function calls in a new spl_rwsem_* interface to ensure type safety if raw_spinlock_t becomes architecture specific, and to satisfy these compiler warnings: warning: passing argument 1 of ‘spinlock_check’ from incompatible pointer type [enabled by default] note: expected ‘struct spinlock_t ’ but argument is of type ‘struct raw_spinlock_t ’ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #76 Closes: zfsonlinux/zfs#463	2012-01-11 16:28:05 -08:00
Brian Behlendorf	5f6c14b1ed	Proxmox VE kernel compat, invalidate_inodes() The Proxmox VE kernel contains a patch which renames the function invalidate_inodes() to invalidate_inodes_check(). In the process it adds a 'check' argument and a '#define invalidate_inodes(x)' compatibility wrapper for legacy callers. Therefore, if either of these functions are exported invalidate_inodes() can be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #58	2011-12-21 14:29:45 -08:00
Prakash Surya	8f2503e0af	Store copy of tqent_flags prior to servicing task A preallocated taskq_ent_t's tqent_flags must be checked prior to servicing the taskq_ent_t. Once a preallocated taskq entry is serviced, the ownership of the entry is handed back to the caller of taskq_dispatch, thus the entry's contents can potentially be mangled. In particular, this is a problem in the case where a preallocated taskq entry is serviced, and the caller clears it's tqent_flags field. Thus, when the function returns and task_done is called, it looks as though the entry is not a preallocated task (when in fact it is a preallocated task). In this situation, task_done will place the preallocated taskq_ent_t structure onto the taskq_t's free list. This is a huge mistake. If the taskq_ent_t is then freed by the caller of taskq_dispatch, the taskq_t's free list will hold a pointer to garbage data. Even worse, if nothing has over written the freed memory before the pointer is dereferenced, it may still look as though it points to a valid list_head belonging to a taskq_ent_t structure. Thus, the task entry's flags are now copied prior to servicing the task. This copy is then checked to see if it is a preallocated task, and determine if the entry needs to be passed down to the task_done function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #71	2011-12-16 16:54:00 -08:00
Prakash Surya	e7e5f78e7b	Swap taskq_ent_t with taskqid_t in taskq_thread_t The taskq_t's active thread list is sorted based on its tqt_ent->tqent_id field. The list is kept sorted solely by inserting new taskq_thread_t's in their correct sorted location; no other means is used. This means that once inserted, if a taskq_thread_t's tqt_ent->tqent_id field changes, the list runs the risk of no longer being sorted. Prior to the introduction of the taskq_dispatch_prealloc() interface, this was not a problem as a taskq_ent_t actively being serviced under the old interface should always have a static tqent_id field. Thus, once the taskq_thread_t is added to the taskq_t's active thread list, the taskq_thread_t's tqt_ent->tqent_id field would remain constant. Now, this is no longer the case. Currently, if using the taskq_dispatch_prealloc() interface, any given taskq_ent_t actively being serviced _may_ have its tqent_id value incremented. This happens when the preallocated taskq_ent_t structure is recursively dispatched. Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id field silently modified from under its feet. If this were to happen to a taskq_thread_t on a taskq_t's active thread list, this would compromise the integrity of the order of the list (as the list _may_ no longer be sorted). To get around this, the taskq_thread_t's taskq_ent_t pointer was replaced with its own static copy of the tqent_id. So, as a taskq_ent_t is pulled off of the taskq_t's pending list, a static copy of its tqent_id is made and this copy is used to sort the active thread list. Using a static copy is key in ensuring the integrity of the order of the active thread list. Even if the underlying taskq_ent_t is recursively dispatched (as has its tqent_id modified), this static copy stored inside the taskq_thread_t will remain constant. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #71	2011-12-16 13:26:54 -08:00
Prakash Surya	c2dceb5cd5	Add make rule for building Arch Linux packages Added the necessary build infrastructure for building packages compatible with the Arch Linux distribution. As such, one can now run: $ ./configure $ make pkg # Alternatively, one can run 'make arch' as well on an Arch Linux machine to create two binary packages compatible with the pacman package manager, one for the spl userland utilties and another for the spl kernel modules. The new packages can then be installed by running: # pacman -U $package.pkg.tar.xz In addition, source-only packages suitable for an Arch Linux chroot environment or remote builder can also be built using the 'sarch' make rule. NOTE: Since the source dist tarball is created on the fly from the head of the build tree, it's MD5 hash signature will be continually influx. As a result, the md5sum variable was intentionally omitted from the PKGBUILD files, and the '--skipinteg' makepkg option is used. This may or may not have any serious security implications, as the source tarball is not being downloaded from an outside source. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #68	2011-12-14 16:44:10 -08:00
Prakash Surya	44217f7aad	Implement taskq_dispatch_prealloc() interface This patch implements the taskq_dispatch_prealloc() interface which was introduced by the following illumos-gate commit. It allows for a preallocated taskq_ent_t to be used when dispatching items to a taskq. This eliminates a memory allocation which helps minimize lock contention in the taskq when dispatching functions. commit 5aeb94743e3be0c51e86f73096334611ae3a058e Author: Garrett D'Amore <garrett@nexenta.com> Date: Wed Jul 27 07:13:44 2011 -0700 734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:57 -08:00
Prakash Surya	2c02b71b14	Replace tq_work_list and tq_threads in taskq_t To lay the ground work for introducing the taskq_dispatch_prealloc() interface, the tq_work_list and tq_threads fields had to be replaced with new alternatives in the taskq_t structure. The tq_threads field was replaced with tq_thread_list. Rather than storing the pointers to the taskq's kernel threads in an array, they are now stored as a list. In addition to laying the ground work for the taskq_dispatch_prealloc() interface, this change could also enable taskq threads to be dynamically created and destroyed as threads can now be added and removed to this list relatively easily. The tq_work_list field was replaced with tq_active_list. Instead of keeping a list of taskq_ent_t's which are currently being serviced, a list of taskq_threads currently servicing a taskq_ent_t is kept. This frees up the taskq_ent_t's tqent_list field when it is being serviced (i.e. now when a taskq_ent_t is being serviced, it's tqent_list field will be empty). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #65	2011-12-13 16:10:50 -08:00
Prakash Surya	93806f58a6	Fix usage of MUTEX macro in mutex_enter_nested A call site of the MUTEX macro had incorrectly placed its closing parenthesis, causing two parameters to be passed rather than one. This change moves the misplaced parenthesis to fix the typographical error. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #70	2011-12-13 11:04:21 -08:00
Chris Dunlop	791dc876eb	Allow 64-bit timestamps to be set on 64-bit kernels ZFS and 64-bit linux are perfectly capable of dealing with 64-bit timestamps, but ZFS deliberately prevents setting them. Adjust the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit values and instead use the correct values for your kernel build. This effectively allows 64-bit timestamps on 64-bit systems. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #487	2011-12-12 11:06:03 -08:00
Brian Behlendorf	1114ae6ae7	Prepend spl_ to all init/fini functions This is a bit of cleanup I'd been meaning to get to for a while to reduce the chance of a type conflict. Well that conflict finally occurred with the kstat_init() function which conflicts with a function in the 2.6.32-6-pve kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #56	2011-11-11 09:18:28 -08:00
Brian Behlendorf	fe71c0e567	Linux 3.1 compat, shrink_*cache_memory As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory functions have been removed. This same task is now accomplished more cleanly with per super block shrinkers. This unfortunately leaves us no easy way to support the dnlc_reduce_cache() function. This support has always been entirely optional. So when no reasonable interface is available allow the dnlc_reduce_cache() function to effectively become a no-op. The downside of this change is that it will prevent the zfs arc meta data limts from being enforced. However, the current zfs implementation in this regard is already flawed and needs to be reworked. If the arc needs to enfore a meta data limit it will need to be extended to coordinate directly with the zpl. This will allow us to drop all this compatibility code and get more fine grained control over the cache management. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 19:36:30 -08:00
Brian Behlendorf	0d0b523728	Linux 3.1 compat, vfs_fsync() Preferentially use the vfs_fsync() function. This function was initially introduced in 2.6.29 and took three arguments. As of 2.6.35 the dentry argument was dropped from the function. For older kernels fall back to using file_fsync() which also took three arguments including the dentry. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 19:36:21 -08:00
Brian Behlendorf	12ff95ff57	Linux 3.1 compat, kern_path_parent() Prior to Linux 3.1 the kern_path_parent symbol was exported for use by kernel modules. As of Linux 3.1 it is now longer easily available. To handle this case the spl will now dynamically look up address of the missing symbol at module load time. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #52	2011-11-09 16:51:25 -08:00
Gunnar Beutner	f5e76dea03	Cleaned up MUTEX() #define The old define assumed a specific layout of the kmutex_t struct. This patch makes the macro independent from the actual struct layout. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-19 09:59:32 -07:00
Gunnar Beutner	66cdc93b8c	Remove the spinlocks for mutex_enter()/mutex_exit() The m_owner variable is protected by the mutex itself. Reading the variable is guaranteed to be atomic (due to it being a word-sized reference) and ACCESS_ONCE() takes care of read cache effects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-10-19 09:58:57 -07:00
Gunnar Beutner	3160d4f56b	Fix race condition in mutex_exit() On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex owner after releasing the mutex. This would cause mutex_owner() to return an incorrect owner if another thread managed to lock the mutex before mutex_exit() had a chance to clear the owner. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #167	2011-10-19 09:58:41 -07:00
Gunnar Beutner	763b2f3b57	Fixed invalid resource re-use in file_find() File descriptors are a per-process resource. The same descriptor in different processes can refer to different files. find_file() incorrectly assumed that file descriptors are globally unique. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes ZFS issue #386	2011-10-11 09:51:51 -07:00
Brian Behlendorf	86fd39f354	Linux 2.6.39 compat, mutex owner Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined the kernel stored a thread_info pointer as the mutex owner. From this you could get the pointer of the current task_struct to compare with get_current(). As of Linux 2.6.39 this behavior has changed and now the mutex stores a pointer to the task_struct. This commit detects the type of pointer stored in the mutex and adjusts the mutex_owner() and mutex_owned() functions to perform the correct comparision.	2011-06-24 13:00:08 -07:00
Darik Horn	0d54dcb566	Read the /etc/hostid file directly. Deprecate the /usr/bin/hostid call by reading the /etc/hostid file directly. Add the spl_hostid_path parameter to override the default /etc/hostid path. Rename the set_hostid() function to hostid_exec() to better reflect actual behavior and complement the new hostid_read() function. Use HW_INVALID_HOSTID as the spl_hostid sentinel value because zero seems to be a valid gethostid() result on Linux.	2011-06-24 09:58:03 -07:00
Brian Behlendorf	bf0c60c060	Add linux compatibility tests While the splat tests were originally designed to stress test the Solaris primatives. I am extending them to include some kernel compatibility tests. Certain linux APIs have changed frequently. These tests ensure that added compatibility is working properly and no unnoticed regression have slipped in. Test 1 and 2 add basic regression tests for shrink_icache_memory and shrink_dcache_memory. These are simply functional tests to ensure we can call these functions safely. Checking for correct behavior is more difficult since other running processes will influence the behavior. However, these functions are provided by the kernel so if we can successfully call them we assume they are working correctly. Test 3 checks that shrinker functions are being registered and called correctly. As of Linux 3.0 the shrinker API has changed four different times so I felt the need to add a trivial test case to ensure each variant works as expected.	2011-06-21 14:02:46 -07:00
Brian Behlendorf	a55bcaad18	Linux 3.0: Shrinker compatibility Update the the wrapper macros for the memory shrinker to handle this 4th API change. The callback function now takes a shrink_control structure. This is certainly a step in the right direction but it's annoying to have to accomidate yet another version of the API.	2011-06-21 14:02:39 -07:00
Brian Behlendorf	372c257233	Add TASKQ_NORECLAIM flag It has become necessary to be able to optionally disable direct memory reclaim for certain taskqs. To support this the TASKQ_NORECLAIM flags has been added which sets the PF_MEMALLOC bit for all threads in the taskq.	2011-05-06 15:23:58 -07:00
Brian Behlendorf	c1f95c2b94	Correct MAXUID The uid_t on most systems is in fact and unsigned 32-bit value. This is almost always correct, however you could compile your kernel to use an unsigned 16-bit value for uid_t. In practice I've never encountered a distribution which does this so I'm willing to overlook this corner case for now.	2011-04-29 13:58:45 -07:00
Gunnar Beutner	9d4b7c17a0	Renamed 'struct fid' for NFS Renamed 'struct fid' because its name conflicts with another struct in the Linux kernel headers. The fid_t typedef remains unchanged intentionally.	2011-04-29 12:10:54 -07:00
Brian Behlendorf	d837ae395b	Fix 32-bit MAXOFFSET_T definition The correct definition of MAXOFFSET_T under Solaris is in reality tied to the maximum size of a 'long long' type. With this in mind MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct value is used on both 32-bit and 64-bit systems.	2011-04-22 16:17:13 -07:00
Darik Horn	fa6f7d8f9d	Import spl_hostid as a module parameter. Provide a call_usermodehelper() alternative by letting the hostid be passed as a module parameter like this: $ modprobe spl spl_hostid=0x12345678 Internally change the spl_hostid variable to unsigned long because that is the type that the coreutils /usr/bin/hostid returns. Move the hostid command into GET_HOSTID_CMD for consistency with the similar GET_KALLSYMS_ADDR_CMD invocation. Use argv[0] instead of sh_path for consistency internally and with other Linux drivers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-04-21 09:41:01 -07:00
Brian Behlendorf	3dfc591ac4	Linux 2.6.39 compat, zlib_deflate_workspacesize() The function zlib_deflate_workspacesize() now take 2 arguments. This was done to avoid always having to allocate the maximum size workspace (268K). The caller can now specific the windowBits and memLevel compression parameters to get a smaller workspace. For our purposes we introduce a spl_zlib_deflate_workspacesize() wrapper which accepts both arguments. When the two argument version of zlib_deflate_workspacesize() is available the arguments are passed through. When it's not we assume the worst case and a maximally sized workspace is used.	2011-04-20 14:39:15 -07:00
Brian Behlendorf	b1cbc4610c	Linux 2.6.39 compat, kern_path_parent() The path_lookup() function has been renamed to kern_path_parent() and the flags argument has been removed. The only behavior now offered is that of LOOKUP_PARENT. The spl already always passed this flag so dropping the flag does not impact us.	2011-04-20 12:30:17 -07:00
Brian Behlendorf	9b0f9079d2	Linux 2.6.39 compat, invalidate_inodes() To resolve a potiential filesystem corruption issue a second argument was added to invalidate_inodes(). This argument controls whether dirty inodes are dropped or treated as busy when invalidating a super block. When only the legacy API is available the second argument will be dropped for compatibility.	2011-04-19 09:08:08 -07:00
Brian Behlendorf	e76f4bf11d	Add dnlc_reduce_cache() support Provide the dnlc_reduce_cache() function which attempts to prune cached entries from the dcache and icache. After the entries are pruned any slabs which they may have been using are reaped. Note the API takes a reclaim percentage but we don't have easy access to the total number of cache entries to calculate the reclaim count. However, in practice this doesn't need to be exactly correct. We simply need to reclaim some useful fraction (but not all) of the cache. The caller can determine if more needs to be done.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	83150861e6	Decrease target objects per slab By decreasing the number of target objects per slab we increase the likelyhood that a slab can be freed. This reduces the level of fragmentation in the slab which has been observed to be a problem for certain workloads. The penalty for this is that we also decrease the speed which need objects can be allocated.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	3336e29cc2	Add slab usage summeries to /proc One of the most common things you want to know when looking at the slab is how much memory is being used. This information was available in /proc/spl/kmem/slab but only on a per-slab basis. This commit adds the following /proc/sys/kernel/spl/kmem/slab* entries to make total slab usage easily available at a glance. slab_kmem_total - Total kmem slab size slab_kmem_avail - Alloc'd kmem slab size slab_kmem_max - Max observed kmem slab size slab_vmem_total - Total vmem slab size slab_vmem_avail - Alloc'd vmem slab size slab_vmem_max - Max observed vmem slab size NOTE: The slab_*_max values are expected to over report because they show maximum values since boot, not current values.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	495bd532ab	Linux shrinker compat The Linux shrinker has gone through three API changes since 2.6.22. Rather than force every caller to understand all three APIs this change consolidates the compatibility code in to the mm-compat.h header. The caller then can then use a single spl provided shrinker API which does the right thing for your kernel. SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask); SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks); spl_register_shrinker(&shrinker_struct); spl_unregister_shrinker(&&shrinker_struct); spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);	2011-04-06 20:06:03 -07:00
Brian Behlendorf	91cb1d91a4	Add .va_dentry helper While this extra structure memory does not exist under Solaris it is needed under Linux to pass the dentry. This allows the dentry to be easily instantiated before the inode is unlocked.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	734fcac78d	Add crgetfsuid()/crgetfsgid() helpers Solaris credentials don't have an fsuid/fsguid field but Linux credentials do. To handle this case the Solaris API is being modestly extended to include the crgetfsuid()/crgetfsgid() helper functions. Addititionally, because the crget*() helpers are implemented identically regardless of HAVE_CRED_STRUCT they have been moved outside the #ifdef to common code. This simplification means we only have one version of the helper to keep to to date.	2011-03-22 12:18:44 -07:00
Brian Behlendorf	cb255ae572	Remove default GFP_NOFS allocations As originally described in commit `82b8c8fa64` this was done to prevent certain deadlocks from occuring in the system. However, as suspected the price for doing this proved to be too high. The VM is having a hard time effectively reclaiming memory thus we are reverting this change. However, we still need to fundamentally handle the issue. Under Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure a memory allocations will succeed. We leverage this fact and redefine KM_PUSHPAGE to include GFP_NOFS. This ensures that in these common I/O path we don't trigger additional reclaim. This minimizes the change to the Solaris code.	2011-03-19 14:50:39 -07:00
Brian Behlendorf	6788762766	Linux 2.6.31 compat, include linux/seq_file.h Explicitly include the linux/seq_file.h header in vfs.h. This header is required for the sequence handlers and is included indirectly in newer kernels.	2011-03-07 13:52:00 -08:00
Brian Behlendorf	47995fa691	Remove xvattr support The xvattr support in the spl has always simply consisted of defining a couple structures and a few #defines. This was enough to enable compilation of code which just passed xvattr types around but not enough to effectively manipulate them. This change removes even this minimal support leaving it up to packages which leverage the spl to prove the full xvattr support. By removing it from the spl we ensure not conflict with the higher level packages. This just leaves minimal vnode support for basical manipulation of files. This code is does have the proper support functions in the spl and a set of regression tests. Additionally, this change removed the unused 'caller_context_t ' type and replaces it with a 'void '.	2011-03-02 11:34:46 -08:00
Brian Behlendorf	a4a1e1ecb4	Add TIMESPEC_OVERFLOW helper Add the TIMESPEC_OVERFLOW helper macro to allow easy checking of timespec overflow.	2011-03-02 11:34:43 -08:00
Brian Behlendorf	19c1eb829d	Add zlib regression test A zlib regression test has been added to verify the correct behavior of z_compress_level() and z_uncompress. The test case simply takes a 128k buffer, it compresses the buffer, it them uncompresses the buffer, and finally it compares the buffers after the transform. If the buffers match then everything is fine and no data was lost. It performs this test for all 9 zlib compression levels.	2011-02-25 16:56:46 -08:00
Brian Behlendorf	5c1967ebe2	Fix zlib compression While portions of the code needed to support z_compress_level() and z_uncompress() where in place. In reality the current implementation was non-functional, it just was compilable. The critical missing component was to setup a workspace for the compress/uncompress stream structures to use. A kmem_cache was added for the workspace area because we require a large chunk of memory. This avoids to need to continually alloc/free this memory and vmap() the pages which is very slow. Several objects will reside in the per-cpu kmem_cache making them quick to acquire and release. A further optimization would be to adjust the implementation to additional ensure the memory is local to the cpu. Currently that may not be the case.	2011-02-25 16:56:22 -08:00
Brian Behlendorf	5a52a782a0	Use Linux flock struct Rather than defining our own structure which will conflict with Linux's version when building 32-bit. Simply setup a typedef to always use the correct Linux version for both 32 ad 64-bit builds.	2011-02-23 14:32:15 -08:00
Brian Behlendorf	914b063133	Linux compat 2.6.37, invalidate_inodes() In the 2.6.37 kernel the function invalidate_inodes() is no longer exported for use by modules. This memory management functionality is needed to invalidate the inodes attached to a super block without unmounting the filesystem. Because this function still exists in the kernel and the prototype is available is a common header all we strictly need is the symbol address. The address is obtained using spl_kallsyms_lookup_name() and assigned to the variable invalidate_inodes_fn. Then a #define is used to replace all instances of invalidate_inodes() with a call to the acquired address. All the complexity is hidden behind HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual. Long term we should try to get this, or another, interface made available to modules again.	2011-02-23 12:44:32 -08:00
Brian Behlendorf	d599e4fa79	Block in cv_destroy() on all waiters Previously we would ASSERT in cv_destroy() if it was ever called with active waiters. However, I've now seen several instances in OpenSolaris code where they do the following: cv_broadcast(); cv_destroy(); This leaves no time for active waiters to be woken up and scheduled and we trip the ASSERT. This has not been observed to be an issue on OpenSolaris because their cv_destroy() basically does nothing. They still do run the risk of the memory being free'd after the cv_destroy() and hitting a bad paging request. But in practice this race is so small and unlikely it either doesn't happen, or is so unlikely when it does happen the root cause has not yet been identified. Rather than risk the same issue in our code this change updates cv_destroy() to block until all waiters have been woken and scheduled. This may take some time because each waiter must acquire the mutex. This change may have an impact on performance for frequently created and destroyed condition variables. That however is a price worth paying it avoid crashing your system. If performance issues are observed they can be addressed by the caller.	2011-02-04 14:09:08 -08:00
Brian Behlendorf	0aff071d18	Minor policy interface Simply add the policy function wrappers. They are completely non-functional and always return that everything is OK, but once again they simplify compilation of dependent packages for now. These can/should be removed once the security policy of the dependent application is completely understood and intergrade as appropriate with Linux.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	ef57fb98e4	Add missing headers Dependent packages require the following missing headers to simplify compilation. The headers are basically just stubbed out with minimal content required.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	3fc97f9335	Add VSA_ACE_* and MAX_ACL_ENTRIES defines The following flags are use to get the proper mask when getting and setting ACLs. I'm hopeful this can all largely go away at some point. We also add a define for the maximum number of ACL entries. MAX_ACL_ENTRIES is used as the maximum number of entries for each type.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	e2b25f698c	Add MAXUID define For Linux the maximum uid can vary depending on how your kernel is built. The Linux kernel still can be compiled with 16 but uids and gids, although I'm not aware of a major distribution which does this (maybe an embedded one?). Given that caviot it is reasonably safe to define the MAXUID as 2147483647.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	5f46a517f1	Add FIGNORECASE define The FIGNORECASE case define is now needed, place it with the related flags.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	3e5d3d3285	Add ksid_index_t and ksid_t types Add the ksid_index_t enum and ksid_t type for use. These types are now used by packages which depend on the SPL.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	d700637207	Minimal VFS additions This patch simply removes the place holder vfs_t type and includes some generic Linux VFS headers. It also makes some minor fid_t additions for compatibility.	2011-01-27 16:06:04 -08:00
Brian Behlendorf	647fa73cf3	Remove VN_HOLD/VN_RELE/VOP_PUTPAGE Previously these were defined to noops but rather than give the misleading impression that these are actually implemented I'm removing the type entirely for clarity.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	bd6ac72b03	Add a few additional vnode #defines These additional constants now have users in dependant packages.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	dcd9cb5a17	Clean vattr_t and vsecattr_t types Minor cleanup for the vattr_t and vsecattr_t types.	2011-01-12 11:38:04 -08:00
Brian Behlendorf	1b439713f1	FRSYNC Should Use O_SYNC The Solaris FRSYNC maps most logically to the Linux O_SYNC. There is no O_RSYNC on Linux but this wasn't noticed until just recently.	2011-01-12 11:38:04 -08:00
Brian Behlendorf	4295b530ee	Add vn_mode_to_vtype/vn_vtype to_mode helpers Add simple helpers to convert a vnode->v_type to a inode->i_mode. These should be used sparingly but they are handy to have.	2011-01-12 11:38:04 -08:00
Neependra Khare	3f688a8c38	Add cv_timedwait_interruptible() function The cv_timedwait() function by definition must wait unconditionally for cv_signal()/cv_broadcast() before waking. This causes processes to go in the D state which increases the load average. The load average is the summation of processes in D state and run queue. To avoid this it can be desirable to sleep interruptibly. These processes do not count against the load average but may be woken by a signal. It is up to the caller to determine why the process was woken it may be for one of three reasons. 1) cv_signal()/cv_broadcast() 2) the timeout expired 3) a signal was received Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-01-11 12:14:48 -08:00
Brian Behlendorf	6bf4d76f47	Linux Compat: inode->i_mutex/i_sem Create spl_inode_lock/spl_inode_unlock compability macros to simply access to the inode mutex/sem. This avoids the need to have to ugly up the code with the required #define's at every call site. At the moment the SPL only uses this in one place but higher layers can benefit from the macro.	2011-01-11 12:14:48 -08:00
Brian Behlendorf	9fe45dc1ac	Add Thread Specific Data (TSD) Implementation Thread specific data has implemented using a hash table, this avoids the need to add a member to the task structure and allows maximum portability between kernels. This implementation has been optimized to keep the tsd_set() and tsd_get() times as small as possible. The majority of the entries in the hash table are for specific tsd entries. These entries are hashed by the product of their key and pid because by design the key and pid are guaranteed to be unique. Their product also has the desirable properly that it will be uniformly distributed over the hash bins providing neither the pid nor key is zero. Under linux the zero pid is always the init process and thus won't be used, and this implementation is careful to never to assign a zero key. By default the hash table is sized to 512 bins which is expected to be sufficient for light to moderate usage of thread specific data. The hash table contains two additional type of entries. They first type is entry is called a 'key' entry and it is added to the hash during tsd_create(). It is used to store the address of the destructor function and it is used as an anchor point. All tsd entries which use the same key will be linked to this entry. This is used during tsd_destory() to quickly call the destructor function for all tsd associated with the key. The 'key' entry may be looked up with tsd_hash_search() by passing the key you wish to lookup and DTOR_PID constant as the pid. The second type of entry is called a 'pid' entry and it is added to the hash the first time a process set a key. The 'pid' entry is also used as an anchor and all tsd for the process will be linked to it. This list is using during tsd_exit() to ensure all registered destructors are run for the process. The 'pid' entry may be looked up with tsd_hash_search() by passing the PID_KEY constant as the key, and the process pid. Note that tsd_exit() is called by thread_exit() so if your using the Solaris thread API you should not need to call tsd_exit() directly.	2010-12-07 10:02:32 -08:00
Ricardo M. Correia	c2f997b0b3	Make kmutex_t typesafe in all cases. When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just a typedef for struct mutex. This is generally OK but has the downside that it can make mistakes such as mutex_lock(&kmutex_var) to pass by unnoticed until someone compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which case kmutex_t is a real struct). Note that the correct API to call should have been mutex_enter() rather than mutex_lock(). We prevent these kind of mistakes by making kmutex_t a real structure with only one field. This makes kmutex_t typesafe and it shouldn't have any impact on the generated assembly code. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:25:32 -08:00
Brian Behlendorf	058de03caa	Clear cv->cv_mutex when not in use For debugging purposes the condition varaibles keep track of the mutex used during a wait. The idea is to validate that all callers always use the same mutex. Unfortunately, we have seen cases where the caller reuses the condition variable with a different mutex but in a way which is known to be safe. My reading of the man pages suggests you should not do this and always cv_destroy()/cv_init() a new mutex. However, there is overhead in doing this and it does appear to be allowed under Solaris. To accomidate this behavior cv_wait_common() and __cv_timedwait() have been modified to clear the associated mutex when the last waiter is dropped. This ensures that while the condition variable is in use the incorrect mutex case is detected. It also allows the condition variable to be safely recycled without requiring the overhead of a cv_destroy()/cv_init() as long as it isn't currently in use. Finally, spin lock cv->cv_lock was removed because it is not required. When the condition variable is used properly the caller will always be holding the mutex so the spin lock is redundant. The lock was originally added because I expected to need to protect more than just the cv->cv_mutex. It turns out that was not the case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:02:34 -08:00
Ned Bass	00ba7ef900	Give ENOTSUP a valid user space error value The ZFS module returns ENOTSUP for several error conditions where an operation is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that is an internal Linux kernel error code that should not be seen by user programs. As a result the zfs utilities print a confusing error message if an unsupported operation is attempted: internal error: Unknown error 524 Aborted This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with user space. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 13:25:49 -08:00
Brian Behlendorf	a50cede388	Linux 2.6.36 compat, wrap RLIM64_INFINITY As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h. This is handled by conditionally defining RLIM64_INFINITY in the SPL only when the kernel does not provide it.	2010-11-09 13:28:55 -08:00
Brian Behlendorf	8294c69bb7	Clear owner after dropping mutex It's important to clear mp->owner after calling mutex_unlock() because when CONFIG_DEBUG_MUTEXES is defined the mutex owner is verified in mutex_unlock(). If we set it to NULL this check fails and the lockdep support is immediately disabled.	2010-11-05 11:52:30 -07:00
Ricardo M. Correia	a68d91d770	atomic___nv() functions need to return the new value atomically. A local variable must be used for the return value to avoid a potential race once the spin lock is dropped. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-09-17 16:03:25 -07:00
Brian Behlendorf	a7958f7eef	Support custom build directories One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz tar -xzf spl-x.y.z.tar.gz cd spl-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This is something the project has almost supported for a long time but finishing this support should save me lots of time.	2010-09-05 21:49:05 -07:00
Brian Behlendorf	73fc084e92	Move vendor check to spl-build.m4 This check was previously done with a hack in config.guess. However, since a new config.guess is copied in to place when forcing a full autoreconf this change was easily lost and never a good idea. This commit also updates all of the autoconf style support scripts in config.	2010-09-02 16:12:02 -07:00
Brian Behlendorf	8371f981f1	Add list_link_replace() function The list_link_replace() function with swap a new item it to the place of an old item in a list. It is the callers responsibility to ensure all lists involved are locked properly.	2010-08-27 14:23:48 -07:00
Brian Behlendorf	d85e28ad69	Add MUTEX_NOT_HELD() function Simply implement the missing MUTEX_NOT_HELD() function using the !MUTEX_HELD construct.	2010-08-27 14:23:48 -07:00
Brian Behlendorf	2b3543025c	Stub out kmem cache defrag API At some point we are going to need to implement the kmem cache move callbacks to allow for kmem cache defragmentation. This commit simply lays a small part of the API ground work, it does not actually implement any of this feature. This is safe for now because the move callbacks are just an optimization. Even if they are registered we don't ever really have to call them.	2010-08-27 14:23:42 -07:00
Brian Behlendorf	8dbd3fbd5e	Add missing atomic functions These functions were not previous needed so they were not added. Now they are so add the full set. atomic_inc_32_nv() atomic_dec_32_nv() atomic_inc_64_nv() atomic_dec_64_nv()	2010-08-27 13:02:55 -07:00
Li Wei	4be55565fe	Fix stack overflow in vn_rdwr() due to memory reclaim Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp mask we may enter memory reclaim during IO. In this case shrink_slab() entered another file system which is notoriously hungry for stack. This additional stack usage may cause a stack overflow. This patch removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The original mask is then restored at vn_close() time. Hats off to the loop driver which does something similiar for the same reason. [...] shrink_slab+0xdc/0x153 try_to_free_pages+0x1da/0x2d7 __alloc_pages+0x1d7/0x2da do_generic_mapping_read+0x2c9/0x36f file_read_actor+0x0/0x145 __generic_file_aio_read+0x14f/0x19b generic_file_aio_read+0x34/0x39 do_sync_read+0xc7/0x104 vfs_read+0xcb/0x171 :spl:vn_rdwr+0x2b8/0x402 :zfs:vdev_file_io_start+0xad/0xe1 [...] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-12 09:34:33 -07:00
Ned Bass	46aa7b3939	Correctly handle rwsem_is_locked() behavior A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=526092 The race condition was fixed in the kernel by acquiring the semaphore's wait_lock inside rwsem_is_locked(). The SPL worked around the race condition by acquiring the wait_lock before calling that function, but with the fix in place it must not do that. This commit implements an autoconf test to detect whether the fixed version of rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an inline static function while the new version is exported as a symbol which we can check for in module.symvers. Depending on the result we correctly implement the needed compatibility macros for proper spinlock handling. Finally, we do the right thing with spin locks in RW_*_HELD() by using the new compatibility macros. We only only acquire the semaphore's wait_lock if it is calling a rwsem_is_locked() that does not itself try to acquire the lock. Some new overhead and a small harmless race is introduced by this change. This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release the wait_lock twice: once for the call to rwsem_is_locked() and once for the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked() that takes the wait_lock, as it will in more recent kernels. The other case which only occurs in legacy kernels could be optimized by taking the lock only once, as was done prior to this commit. However, I decided that the performance gain probably wasn't significant enough to justify the messy special cases required. The function spl_rw_get_owner() was only used to enable the afore-mentioned optimization. Since it is no longer used, I removed it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-10 16:43:00 -07:00
Brian Behlendorf	099dc9c2d2	Add uninstall Makefile targets Extend the Makefiles with an uninstall target to cleanly remove a package which was installed with 'make install'. Additionally, ensure a 'depmod -a' is run as part of the install to update the module dependency information.	2010-07-28 14:55:32 -07:00
Brian Behlendorf	287b2fb117	Add Debian and Slackware style packaging via alien The long term fix for Debian and Slackware style packaging is to add native support for building these packages. Unfortunately, that is a large chunk of work I don't have time for right now. That said it would be nice to have at least basic packages for these distributions. As a quick short/medium term solution I've settled on using alien to convert the RPM packages to DEB or TGZ style packages. The build system has been updated with the following build targets which will first build RPM packages and then convert them as needed to the target package type: make rpm: Create .rpm packages make deb: Create .deb packages make tgz: Create .tgz packages make pkg: Create the right package type for your distribution The solution comes with lot of caveats and your mileage may vary. But basically the big limitations are that the resulting packages: 1) Will not have the correct dependency information. 2) Will not not include the kernel version in the release. 3) Will not handle all differences between distributions. But the resulting packages should be easy to install and remove from your system and take care of running 'depmod -a' and such. As I said at the top this is not the right long term solution. If any of the upstream distribution maintainers want to jump in and help do this right for their distribution I'd love the help.	2010-07-27 15:52:34 -07:00
Brian Behlendorf	10129680f8	Ensure kmem_alloc() and vmem_alloc() never fail The Solaris semantics for kmem_alloc() and vmem_alloc() are that they must never fail when called with KM_SLEEP. They may only fail if called with KM_NOSLEEP otherwise they must block until memory is available. This is quite different from how the Linux memory allocators work, under Linux a memory allocation failure is always possible and must be dealt with. At one point in the past the kmem code did properly implement this behavior, however as the code evolved this behavior was overlooked in places. This patch goes through all three implementations of the kmem/vmem allocation functions and ensures that they will all block in the KM_SLEEP case when memory is not available. They may still fail in the KM_NOSLEEP case in which case the caller is responsible for handling the failure. Special care is taken in vmalloc_nofail() to avoid thrashing the system on the virtual address space spin lock. The down side of course is if you do see a failure here, which is unlikely for 64-bit systems, your allocation will delay for an entire second. Still this is preferable to locking up your system and it is the best we can do given the constraints. Additionally, the code was cleaned up to be much more readable and comments were added to describe the various kmem-debug-* configure options. The default configure options remain: "--enable-debug-kmem --disable-debug-kmem-tracking"	2010-07-26 15:47:55 -07:00
Ricardo M. Correia	15b52c083e	Fix max_ncpus definition. It was being defined as the constant 64 and at first I changed it to be NR_CPUS instead. However, NR_CPUS can be a large value on recent kernels (4096), and this may cause too large kmem allocations to happen. Therefore, now we use num_possible_cpus(), which should return a (typically) small value which represents the maximum number of CPUs than can be brought online in the running hardware (this value is determined at boot time by arch-specific kernel code). Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 15:49:25 -07:00
Ricardo M. Correia	81672c0122	Display DEBUG keyword during module load when --enable-debug is used. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 15:31:03 -07:00
Ricardo M. Correia	9dd5d138b2	Fix bcopy() to allow memory area overlap Under Solaris bcopy() allows overlapping memory areas so we must use memmove() instead of memcpy(). Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 13:48:53 -07:00
Ricardo M. Correia	22cd0f19b1	Fix compilation error due to undefined ACCESS_ONCE macro. When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes store the owner for debugging purposes, therefore the SPL will enable HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the owner, and this macro is not defined in the RHEL5 kernel, therefore we define it ourselves in include/linux/compiler_compat.h. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 13:47:52 -07:00
Brian Behlendorf	b17edc10a9	Prefix all SPL debug macros with 'S' To avoid conflicts with symbols defined by dependent packages all debugging symbols have been prefixed with a 'S' for SPL. Any dependent package needing to integrate with the SPL debug should include the spl-debug.h header and use the 'S' prefixed macros. They must also build with DEBUG defined.	2010-07-20 13:30:40 -07:00
Brian Behlendorf	55abb0929e	Split <sys/debug.h> header To avoid symbol conflicts with dependent packages the debug header must be split in to several parts. The <sys/debug.h> header now only contains the Solaris macro's such as ASSERT and VERIFY. The spl-debug.h header contain the spl specific debugging infrastructure and should be included by any package which needs to use the spl logging. Finally the spl-trace.h header contains internal data structures only used for the log facility and should not be included by anythign by spl-debug.c. This way dependent packages can include the standard Solaris headers without picking up any SPL debug macros. However, if the dependant package want to integrate with the SPL debugging subsystem they can then explicitly include spl-debug.h. Along with this change I have dropped the CHECK_STACK macros because the upstream Linux kernel now has much better stack depth checking built in and we don't need this complexity. Additionally SBUG has been replaced with PANIC and provided as part of the Solaris macro set. While the Solaris version is really panic() that conflicts with the Linux kernel so we'll just have to make due to PANIC. It should rarely be called directly, the prefered usage would be an ASSERT or VERIFY. There's lots of change here but this cleanup was overdue.	2010-07-20 13:29:35 -07:00
Brian Behlendorf	f0ff89fc86	Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry ' The prototype for filp_fsync() drop the unused argument 'stuct dentry '. I've fixed this by adding the needed autoconf check and moving all of those filp related functions to file_compat.h. This will simplify handling any further API changes in the future.	2010-07-14 11:40:55 -07:00
Brian Behlendorf	82b8c8fa64	Proposed fix for low memory ZFS deadlocks Deadlocks in the zvol were observed when one of the ZFS threads performing IO trys to allocate memory while the system is low on memory. The low memory condition causes dirty pages to be synced to the zvol but this can't progress because the original thread is blocked waiting on a memory allocation. Thus we end up deadlocking. A proper solution proposed by Wizeman is to change KM_SLEEP from GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation which is trying to allocate memory from forcing a sync to the zvol in shrink_page_list()->pageout(). The down side to all of this is that we are using a pretty big hammer by changing KM_SLEEP. This change means ALL of the zfs memory allocations will be until to trigger dirty data to be synced. The caller still should be able to reclaim memory from the various slab caches. We will be totally dependent of other kernel processes which happen to be running and a small number of asynchronous reclaim threads to trigger the reclaim of dirty data pages. This should be OK but I think we may see some slightly longer allocation times when under memory pressure. We shall see.	2010-07-13 21:30:56 -07:00
Brian Behlendorf	a4bfd8ea1b	Add __divdi3(), remove __udivdi3() kernel dependency Up until now no SPL consumer attempted to perform signed 64-bit division so there was no need to support this. That has now changed so I adding 64-bit division support for 32-bit platforms. The signed implementation is based on the unsigned version. Since the have been several bug reports in the past concerning correct 64-bit division on 32-bit platforms I added some long over due regression tests. Much to my surprise the unsigned 64-bit division regression tests failed. This was surprising because __udivdi3() was implemented by simply calling div64_u64() which is provided by the kernel. This meant that the linux kernels 64-bit division algorithm on 32-bit platforms was flawed. After some investigation this turned out to be exactly the case. Because of this I was forced to abandon the kernel helper and instead to fully implement 64-bit division in the spl. There are several published implementation out there on how to do this properly and I settled on one proposed in the book Hacker's Delight. Their proposed algoritm is freely available without restriction and I have just modified it to be linux kernel friendly. The update implementation now passed all the unsigned and signed regression tests. This should be functional, but not fast, which is good enough for out purposes. If you want fast too I'd strongly suggest you upgrade to a 64-bit platform. I have also reported the kernel bug and we'll see if we can't get it fixed up stream.	2010-07-13 16:44:02 -07:00
Ned Bass	f0d8bb26b4	Implementation of the TQ_FRONT flag. Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker threads pull tasks from this high priority queue before the default pending queue. Executing tasks out of FIFO order potentially breaks taskq_lowest_id() if we do not preserve the ordering of the work list by taskqid. Therefore, instead of always appending to the work list, we search for the appropriate place to insert a task. The common case is to append to the list, so we make this operation efficient by searching the work list in reverse order. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-01 10:59:38 -07:00
Brian Behlendorf	c950d1480d	Only make compiler warnings fatal with --enable-debug While in theory I like the idea of compiler warnings always being fatal. In practice this causes problems when small harmless errors cause build failures for end users. To handle this I've updated the build system such that -Werror is only used when --enable-debug is passed to configure. This is how I always build when developing so I'll catch all build warnings and end users will not get stuck by minor issues.	2010-06-30 17:05:36 -07:00
Brian Behlendorf	6801b7154c	Linux-2.6.33 compat, O_DSYNC flag added Prior to linux-2.6.33 only O_DSYNC semantics were implemented and they used the O_SYNC flag. As of linux-2.6.33 this behavior was properly split in to O_SYNC and O_DSYNC respectively.	2010-06-30 12:49:39 -07:00
Brian Behlendorf	79a3bf130b	Linux-2.6.33 compat, .ctl_name removed from struct ctl_table As of linux-2.6.33 the ctl_name member of the ctl_table struct has been entirely removed. The upstream code has been updated to depend entirely on the the procname member. To handle this all references to ctl_name are wrapped in a CTL_NAME macro which simply expands to nothing for newer kernels. Older kernels are supported by having it expand to .ctl_name = X just as before.	2010-06-30 12:49:12 -07:00

1 2 3 4 5 ...

496 Commits