mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2025-08-06 15:07:39 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	d837ae395b	Fix 32-bit MAXOFFSET_T definition The correct definition of MAXOFFSET_T under Solaris is in reality tied to the maximum size of a 'long long' type. With this in mind MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct value is used on both 32-bit and 64-bit systems.	2011-04-22 16:17:13 -07:00
Darik Horn	fa6f7d8f9d	Import spl_hostid as a module parameter. Provide a call_usermodehelper() alternative by letting the hostid be passed as a module parameter like this: $ modprobe spl spl_hostid=0x12345678 Internally change the spl_hostid variable to unsigned long because that is the type that the coreutils /usr/bin/hostid returns. Move the hostid command into GET_HOSTID_CMD for consistency with the similar GET_KALLSYMS_ADDR_CMD invocation. Use argv[0] instead of sh_path for consistency internally and with other Linux drivers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-04-21 09:41:01 -07:00
Brian Behlendorf	3dfc591ac4	Linux 2.6.39 compat, zlib_deflate_workspacesize() The function zlib_deflate_workspacesize() now take 2 arguments. This was done to avoid always having to allocate the maximum size workspace (268K). The caller can now specific the windowBits and memLevel compression parameters to get a smaller workspace. For our purposes we introduce a spl_zlib_deflate_workspacesize() wrapper which accepts both arguments. When the two argument version of zlib_deflate_workspacesize() is available the arguments are passed through. When it's not we assume the worst case and a maximally sized workspace is used.	2011-04-20 14:39:15 -07:00
Brian Behlendorf	e76f4bf11d	Add dnlc_reduce_cache() support Provide the dnlc_reduce_cache() function which attempts to prune cached entries from the dcache and icache. After the entries are pruned any slabs which they may have been using are reaped. Note the API takes a reclaim percentage but we don't have easy access to the total number of cache entries to calculate the reclaim count. However, in practice this doesn't need to be exactly correct. We simply need to reclaim some useful fraction (but not all) of the cache. The caller can determine if more needs to be done.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	83150861e6	Decrease target objects per slab By decreasing the number of target objects per slab we increase the likelyhood that a slab can be freed. This reduces the level of fragmentation in the slab which has been observed to be a problem for certain workloads. The penalty for this is that we also decrease the speed which need objects can be allocated.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	3336e29cc2	Add slab usage summeries to /proc One of the most common things you want to know when looking at the slab is how much memory is being used. This information was available in /proc/spl/kmem/slab but only on a per-slab basis. This commit adds the following /proc/sys/kernel/spl/kmem/slab* entries to make total slab usage easily available at a glance. slab_kmem_total - Total kmem slab size slab_kmem_avail - Alloc'd kmem slab size slab_kmem_max - Max observed kmem slab size slab_vmem_total - Total vmem slab size slab_vmem_avail - Alloc'd vmem slab size slab_vmem_max - Max observed vmem slab size NOTE: The slab_*_max values are expected to over report because they show maximum values since boot, not current values.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	91cb1d91a4	Add .va_dentry helper While this extra structure memory does not exist under Solaris it is needed under Linux to pass the dentry. This allows the dentry to be easily instantiated before the inode is unlocked.	2011-04-06 20:06:03 -07:00
Brian Behlendorf	734fcac78d	Add crgetfsuid()/crgetfsgid() helpers Solaris credentials don't have an fsuid/fsguid field but Linux credentials do. To handle this case the Solaris API is being modestly extended to include the crgetfsuid()/crgetfsgid() helper functions. Addititionally, because the crget*() helpers are implemented identically regardless of HAVE_CRED_STRUCT they have been moved outside the #ifdef to common code. This simplification means we only have one version of the helper to keep to to date.	2011-03-22 12:18:44 -07:00
Brian Behlendorf	cb255ae572	Remove default GFP_NOFS allocations As originally described in commit `82b8c8fa64` this was done to prevent certain deadlocks from occuring in the system. However, as suspected the price for doing this proved to be too high. The VM is having a hard time effectively reclaiming memory thus we are reverting this change. However, we still need to fundamentally handle the issue. Under Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure a memory allocations will succeed. We leverage this fact and redefine KM_PUSHPAGE to include GFP_NOFS. This ensures that in these common I/O path we don't trigger additional reclaim. This minimizes the change to the Solaris code.	2011-03-19 14:50:39 -07:00
Brian Behlendorf	6788762766	Linux 2.6.31 compat, include linux/seq_file.h Explicitly include the linux/seq_file.h header in vfs.h. This header is required for the sequence handlers and is included indirectly in newer kernels.	2011-03-07 13:52:00 -08:00
Brian Behlendorf	47995fa691	Remove xvattr support The xvattr support in the spl has always simply consisted of defining a couple structures and a few #defines. This was enough to enable compilation of code which just passed xvattr types around but not enough to effectively manipulate them. This change removes even this minimal support leaving it up to packages which leverage the spl to prove the full xvattr support. By removing it from the spl we ensure not conflict with the higher level packages. This just leaves minimal vnode support for basical manipulation of files. This code is does have the proper support functions in the spl and a set of regression tests. Additionally, this change removed the unused 'caller_context_t ' type and replaces it with a 'void '.	2011-03-02 11:34:46 -08:00
Brian Behlendorf	a4a1e1ecb4	Add TIMESPEC_OVERFLOW helper Add the TIMESPEC_OVERFLOW helper macro to allow easy checking of timespec overflow.	2011-03-02 11:34:43 -08:00
Brian Behlendorf	5c1967ebe2	Fix zlib compression While portions of the code needed to support z_compress_level() and z_uncompress() where in place. In reality the current implementation was non-functional, it just was compilable. The critical missing component was to setup a workspace for the compress/uncompress stream structures to use. A kmem_cache was added for the workspace area because we require a large chunk of memory. This avoids to need to continually alloc/free this memory and vmap() the pages which is very slow. Several objects will reside in the per-cpu kmem_cache making them quick to acquire and release. A further optimization would be to adjust the implementation to additional ensure the memory is local to the cpu. Currently that may not be the case.	2011-02-25 16:56:22 -08:00
Brian Behlendorf	5a52a782a0	Use Linux flock struct Rather than defining our own structure which will conflict with Linux's version when building 32-bit. Simply setup a typedef to always use the correct Linux version for both 32 ad 64-bit builds.	2011-02-23 14:32:15 -08:00
Brian Behlendorf	d599e4fa79	Block in cv_destroy() on all waiters Previously we would ASSERT in cv_destroy() if it was ever called with active waiters. However, I've now seen several instances in OpenSolaris code where they do the following: cv_broadcast(); cv_destroy(); This leaves no time for active waiters to be woken up and scheduled and we trip the ASSERT. This has not been observed to be an issue on OpenSolaris because their cv_destroy() basically does nothing. They still do run the risk of the memory being free'd after the cv_destroy() and hitting a bad paging request. But in practice this race is so small and unlikely it either doesn't happen, or is so unlikely when it does happen the root cause has not yet been identified. Rather than risk the same issue in our code this change updates cv_destroy() to block until all waiters have been woken and scheduled. This may take some time because each waiter must acquire the mutex. This change may have an impact on performance for frequently created and destroyed condition variables. That however is a price worth paying it avoid crashing your system. If performance issues are observed they can be addressed by the caller.	2011-02-04 14:09:08 -08:00
Brian Behlendorf	0aff071d18	Minor policy interface Simply add the policy function wrappers. They are completely non-functional and always return that everything is OK, but once again they simplify compilation of dependent packages for now. These can/should be removed once the security policy of the dependent application is completely understood and intergrade as appropriate with Linux.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	ef57fb98e4	Add missing headers Dependent packages require the following missing headers to simplify compilation. The headers are basically just stubbed out with minimal content required.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	3fc97f9335	Add VSA_ACE_* and MAX_ACL_ENTRIES defines The following flags are use to get the proper mask when getting and setting ACLs. I'm hopeful this can all largely go away at some point. We also add a define for the maximum number of ACL entries. MAX_ACL_ENTRIES is used as the maximum number of entries for each type.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	e2b25f698c	Add MAXUID define For Linux the maximum uid can vary depending on how your kernel is built. The Linux kernel still can be compiled with 16 but uids and gids, although I'm not aware of a major distribution which does this (maybe an embedded one?). Given that caviot it is reasonably safe to define the MAXUID as 2147483647.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	5f46a517f1	Add FIGNORECASE define The FIGNORECASE case define is now needed, place it with the related flags.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	3e5d3d3285	Add ksid_index_t and ksid_t types Add the ksid_index_t enum and ksid_t type for use. These types are now used by packages which depend on the SPL.	2011-01-27 16:06:09 -08:00
Brian Behlendorf	d700637207	Minimal VFS additions This patch simply removes the place holder vfs_t type and includes some generic Linux VFS headers. It also makes some minor fid_t additions for compatibility.	2011-01-27 16:06:04 -08:00
Brian Behlendorf	647fa73cf3	Remove VN_HOLD/VN_RELE/VOP_PUTPAGE Previously these were defined to noops but rather than give the misleading impression that these are actually implemented I'm removing the type entirely for clarity.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	bd6ac72b03	Add a few additional vnode #defines These additional constants now have users in dependant packages.	2011-01-12 11:38:05 -08:00
Brian Behlendorf	dcd9cb5a17	Clean vattr_t and vsecattr_t types Minor cleanup for the vattr_t and vsecattr_t types.	2011-01-12 11:38:04 -08:00
Brian Behlendorf	1b439713f1	FRSYNC Should Use O_SYNC The Solaris FRSYNC maps most logically to the Linux O_SYNC. There is no O_RSYNC on Linux but this wasn't noticed until just recently.	2011-01-12 11:38:04 -08:00
Brian Behlendorf	4295b530ee	Add vn_mode_to_vtype/vn_vtype to_mode helpers Add simple helpers to convert a vnode->v_type to a inode->i_mode. These should be used sparingly but they are handy to have.	2011-01-12 11:38:04 -08:00
Neependra Khare	3f688a8c38	Add cv_timedwait_interruptible() function The cv_timedwait() function by definition must wait unconditionally for cv_signal()/cv_broadcast() before waking. This causes processes to go in the D state which increases the load average. The load average is the summation of processes in D state and run queue. To avoid this it can be desirable to sleep interruptibly. These processes do not count against the load average but may be woken by a signal. It is up to the caller to determine why the process was woken it may be for one of three reasons. 1) cv_signal()/cv_broadcast() 2) the timeout expired 3) a signal was received Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2011-01-11 12:14:48 -08:00
Brian Behlendorf	9fe45dc1ac	Add Thread Specific Data (TSD) Implementation Thread specific data has implemented using a hash table, this avoids the need to add a member to the task structure and allows maximum portability between kernels. This implementation has been optimized to keep the tsd_set() and tsd_get() times as small as possible. The majority of the entries in the hash table are for specific tsd entries. These entries are hashed by the product of their key and pid because by design the key and pid are guaranteed to be unique. Their product also has the desirable properly that it will be uniformly distributed over the hash bins providing neither the pid nor key is zero. Under linux the zero pid is always the init process and thus won't be used, and this implementation is careful to never to assign a zero key. By default the hash table is sized to 512 bins which is expected to be sufficient for light to moderate usage of thread specific data. The hash table contains two additional type of entries. They first type is entry is called a 'key' entry and it is added to the hash during tsd_create(). It is used to store the address of the destructor function and it is used as an anchor point. All tsd entries which use the same key will be linked to this entry. This is used during tsd_destory() to quickly call the destructor function for all tsd associated with the key. The 'key' entry may be looked up with tsd_hash_search() by passing the key you wish to lookup and DTOR_PID constant as the pid. The second type of entry is called a 'pid' entry and it is added to the hash the first time a process set a key. The 'pid' entry is also used as an anchor and all tsd for the process will be linked to it. This list is using during tsd_exit() to ensure all registered destructors are run for the process. The 'pid' entry may be looked up with tsd_hash_search() by passing the PID_KEY constant as the key, and the process pid. Note that tsd_exit() is called by thread_exit() so if your using the Solaris thread API you should not need to call tsd_exit() directly.	2010-12-07 10:02:32 -08:00
Ricardo M. Correia	c2f997b0b3	Make kmutex_t typesafe in all cases. When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just a typedef for struct mutex. This is generally OK but has the downside that it can make mistakes such as mutex_lock(&kmutex_var) to pass by unnoticed until someone compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which case kmutex_t is a real struct). Note that the correct API to call should have been mutex_enter() rather than mutex_lock(). We prevent these kind of mistakes by making kmutex_t a real structure with only one field. This makes kmutex_t typesafe and it shouldn't have any impact on the generated assembly code. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:25:32 -08:00
Brian Behlendorf	058de03caa	Clear cv->cv_mutex when not in use For debugging purposes the condition varaibles keep track of the mutex used during a wait. The idea is to validate that all callers always use the same mutex. Unfortunately, we have seen cases where the caller reuses the condition variable with a different mutex but in a way which is known to be safe. My reading of the man pages suggests you should not do this and always cv_destroy()/cv_init() a new mutex. However, there is overhead in doing this and it does appear to be allowed under Solaris. To accomidate this behavior cv_wait_common() and __cv_timedwait() have been modified to clear the associated mutex when the last waiter is dropped. This ensures that while the condition variable is in use the incorrect mutex case is detected. It also allows the condition variable to be safely recycled without requiring the overhead of a cv_destroy()/cv_init() as long as it isn't currently in use. Finally, spin lock cv->cv_lock was removed because it is not required. When the condition variable is used properly the caller will always be holding the mutex so the spin lock is redundant. The lock was originally added because I expected to need to protect more than just the cv->cv_mutex. It turns out that was not the case. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-29 11:02:34 -08:00
Ned Bass	00ba7ef900	Give ENOTSUP a valid user space error value The ZFS module returns ENOTSUP for several error conditions where an operation is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that is an internal Linux kernel error code that should not be seen by user programs. As a result the zfs utilities print a confusing error message if an unsupported operation is attempted: internal error: Unknown error 524 Aborted This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with user space. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-11-10 13:25:49 -08:00
Brian Behlendorf	a50cede388	Linux 2.6.36 compat, wrap RLIM64_INFINITY As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h. This is handled by conditionally defining RLIM64_INFINITY in the SPL only when the kernel does not provide it.	2010-11-09 13:28:55 -08:00
Brian Behlendorf	8294c69bb7	Clear owner after dropping mutex It's important to clear mp->owner after calling mutex_unlock() because when CONFIG_DEBUG_MUTEXES is defined the mutex owner is verified in mutex_unlock(). If we set it to NULL this check fails and the lockdep support is immediately disabled.	2010-11-05 11:52:30 -07:00
Ricardo M. Correia	a68d91d770	atomic___nv() functions need to return the new value atomically. A local variable must be used for the return value to avoid a potential race once the spin lock is dropped. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-09-17 16:03:25 -07:00
Brian Behlendorf	8371f981f1	Add list_link_replace() function The list_link_replace() function with swap a new item it to the place of an old item in a list. It is the callers responsibility to ensure all lists involved are locked properly.	2010-08-27 14:23:48 -07:00
Brian Behlendorf	d85e28ad69	Add MUTEX_NOT_HELD() function Simply implement the missing MUTEX_NOT_HELD() function using the !MUTEX_HELD construct.	2010-08-27 14:23:48 -07:00
Brian Behlendorf	2b3543025c	Stub out kmem cache defrag API At some point we are going to need to implement the kmem cache move callbacks to allow for kmem cache defragmentation. This commit simply lays a small part of the API ground work, it does not actually implement any of this feature. This is safe for now because the move callbacks are just an optimization. Even if they are registered we don't ever really have to call them.	2010-08-27 14:23:42 -07:00
Brian Behlendorf	8dbd3fbd5e	Add missing atomic functions These functions were not previous needed so they were not added. Now they are so add the full set. atomic_inc_32_nv() atomic_dec_32_nv() atomic_inc_64_nv() atomic_dec_64_nv()	2010-08-27 13:02:55 -07:00
Li Wei	4be55565fe	Fix stack overflow in vn_rdwr() due to memory reclaim Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp mask we may enter memory reclaim during IO. In this case shrink_slab() entered another file system which is notoriously hungry for stack. This additional stack usage may cause a stack overflow. This patch removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The original mask is then restored at vn_close() time. Hats off to the loop driver which does something similiar for the same reason. [...] shrink_slab+0xdc/0x153 try_to_free_pages+0x1da/0x2d7 __alloc_pages+0x1d7/0x2da do_generic_mapping_read+0x2c9/0x36f file_read_actor+0x0/0x145 __generic_file_aio_read+0x14f/0x19b generic_file_aio_read+0x34/0x39 do_sync_read+0xc7/0x104 vfs_read+0xcb/0x171 :spl:vn_rdwr+0x2b8/0x402 :zfs:vdev_file_io_start+0xad/0xe1 [...] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-12 09:34:33 -07:00
Ned Bass	46aa7b3939	Correctly handle rwsem_is_locked() behavior A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=526092 The race condition was fixed in the kernel by acquiring the semaphore's wait_lock inside rwsem_is_locked(). The SPL worked around the race condition by acquiring the wait_lock before calling that function, but with the fix in place it must not do that. This commit implements an autoconf test to detect whether the fixed version of rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an inline static function while the new version is exported as a symbol which we can check for in module.symvers. Depending on the result we correctly implement the needed compatibility macros for proper spinlock handling. Finally, we do the right thing with spin locks in RW_*_HELD() by using the new compatibility macros. We only only acquire the semaphore's wait_lock if it is calling a rwsem_is_locked() that does not itself try to acquire the lock. Some new overhead and a small harmless race is introduced by this change. This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release the wait_lock twice: once for the call to rwsem_is_locked() and once for the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked() that takes the wait_lock, as it will in more recent kernels. The other case which only occurs in legacy kernels could be optimized by taking the lock only once, as was done prior to this commit. However, I decided that the performance gain probably wasn't significant enough to justify the messy special cases required. The function spl_rw_get_owner() was only used to enable the afore-mentioned optimization. Since it is no longer used, I removed it. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-10 16:43:00 -07:00
Brian Behlendorf	10129680f8	Ensure kmem_alloc() and vmem_alloc() never fail The Solaris semantics for kmem_alloc() and vmem_alloc() are that they must never fail when called with KM_SLEEP. They may only fail if called with KM_NOSLEEP otherwise they must block until memory is available. This is quite different from how the Linux memory allocators work, under Linux a memory allocation failure is always possible and must be dealt with. At one point in the past the kmem code did properly implement this behavior, however as the code evolved this behavior was overlooked in places. This patch goes through all three implementations of the kmem/vmem allocation functions and ensures that they will all block in the KM_SLEEP case when memory is not available. They may still fail in the KM_NOSLEEP case in which case the caller is responsible for handling the failure. Special care is taken in vmalloc_nofail() to avoid thrashing the system on the virtual address space spin lock. The down side of course is if you do see a failure here, which is unlikely for 64-bit systems, your allocation will delay for an entire second. Still this is preferable to locking up your system and it is the best we can do given the constraints. Additionally, the code was cleaned up to be much more readable and comments were added to describe the various kmem-debug-* configure options. The default configure options remain: "--enable-debug-kmem --disable-debug-kmem-tracking"	2010-07-26 15:47:55 -07:00
Ricardo M. Correia	15b52c083e	Fix max_ncpus definition. It was being defined as the constant 64 and at first I changed it to be NR_CPUS instead. However, NR_CPUS can be a large value on recent kernels (4096), and this may cause too large kmem allocations to happen. Therefore, now we use num_possible_cpus(), which should return a (typically) small value which represents the maximum number of CPUs than can be brought online in the running hardware (this value is determined at boot time by arch-specific kernel code). Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 15:49:25 -07:00
Ricardo M. Correia	81672c0122	Display DEBUG keyword during module load when --enable-debug is used. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 15:31:03 -07:00
Ricardo M. Correia	9dd5d138b2	Fix bcopy() to allow memory area overlap Under Solaris bcopy() allows overlapping memory areas so we must use memmove() instead of memcpy(). Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 13:48:53 -07:00
Ricardo M. Correia	22cd0f19b1	Fix compilation error due to undefined ACCESS_ONCE macro. When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes store the owner for debugging purposes, therefore the SPL will enable HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the owner, and this macro is not defined in the RHEL5 kernel, therefore we define it ourselves in include/linux/compiler_compat.h. Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-20 13:47:52 -07:00
Brian Behlendorf	55abb0929e	Split <sys/debug.h> header To avoid symbol conflicts with dependent packages the debug header must be split in to several parts. The <sys/debug.h> header now only contains the Solaris macro's such as ASSERT and VERIFY. The spl-debug.h header contain the spl specific debugging infrastructure and should be included by any package which needs to use the spl logging. Finally the spl-trace.h header contains internal data structures only used for the log facility and should not be included by anythign by spl-debug.c. This way dependent packages can include the standard Solaris headers without picking up any SPL debug macros. However, if the dependant package want to integrate with the SPL debugging subsystem they can then explicitly include spl-debug.h. Along with this change I have dropped the CHECK_STACK macros because the upstream Linux kernel now has much better stack depth checking built in and we don't need this complexity. Additionally SBUG has been replaced with PANIC and provided as part of the Solaris macro set. While the Solaris version is really panic() that conflicts with the Linux kernel so we'll just have to make due to PANIC. It should rarely be called directly, the prefered usage would be an ASSERT or VERIFY. There's lots of change here but this cleanup was overdue.	2010-07-20 13:29:35 -07:00
Brian Behlendorf	82b8c8fa64	Proposed fix for low memory ZFS deadlocks Deadlocks in the zvol were observed when one of the ZFS threads performing IO trys to allocate memory while the system is low on memory. The low memory condition causes dirty pages to be synced to the zvol but this can't progress because the original thread is blocked waiting on a memory allocation. Thus we end up deadlocking. A proper solution proposed by Wizeman is to change KM_SLEEP from GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation which is trying to allocate memory from forcing a sync to the zvol in shrink_page_list()->pageout(). The down side to all of this is that we are using a pretty big hammer by changing KM_SLEEP. This change means ALL of the zfs memory allocations will be until to trigger dirty data to be synced. The caller still should be able to reclaim memory from the various slab caches. We will be totally dependent of other kernel processes which happen to be running and a small number of asynchronous reclaim threads to trigger the reclaim of dirty data pages. This should be OK but I think we may see some slightly longer allocation times when under memory pressure. We shall see.	2010-07-13 21:30:56 -07:00
Brian Behlendorf	a4bfd8ea1b	Add __divdi3(), remove __udivdi3() kernel dependency Up until now no SPL consumer attempted to perform signed 64-bit division so there was no need to support this. That has now changed so I adding 64-bit division support for 32-bit platforms. The signed implementation is based on the unsigned version. Since the have been several bug reports in the past concerning correct 64-bit division on 32-bit platforms I added some long over due regression tests. Much to my surprise the unsigned 64-bit division regression tests failed. This was surprising because __udivdi3() was implemented by simply calling div64_u64() which is provided by the kernel. This meant that the linux kernels 64-bit division algorithm on 32-bit platforms was flawed. After some investigation this turned out to be exactly the case. Because of this I was forced to abandon the kernel helper and instead to fully implement 64-bit division in the spl. There are several published implementation out there on how to do this properly and I settled on one proposed in the book Hacker's Delight. Their proposed algoritm is freely available without restriction and I have just modified it to be linux kernel friendly. The update implementation now passed all the unsigned and signed regression tests. This should be functional, but not fast, which is good enough for out purposes. If you want fast too I'd strongly suggest you upgrade to a 64-bit platform. I have also reported the kernel bug and we'll see if we can't get it fixed up stream.	2010-07-13 16:44:02 -07:00
Ned Bass	f0d8bb26b4	Implementation of the TQ_FRONT flag. Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker threads pull tasks from this high priority queue before the default pending queue. Executing tasks out of FIFO order potentially breaks taskq_lowest_id() if we do not preserve the ordering of the work list by taskqid. Therefore, instead of always appending to the work list, we search for the appropriate place to insert a task. The common case is to append to the list, so we make this operation efficient by searching the work list in reverse order. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-07-01 10:59:38 -07:00

1 2 3 4 5 ...

265 Commits