mirror_zfs/include/sys
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
..
fm Remove autotools products 2012-08-27 11:46:23 -07:00
fs Remove autotools products 2012-08-27 11:46:23 -07:00
sysevent Remove autotools products 2012-08-27 11:46:23 -07:00
acl_impl.h Public Release Prep 2010-05-17 15:18:00 -07:00
acl.h Add VSA_ACE_* and MAX_ACL_ENTRIES defines 2011-01-27 16:06:09 -08:00
atomic.h atomic_*_*_nv() functions need to return the new value atomically. 2010-09-17 16:03:25 -07:00
attr.h Public Release Prep 2010-05-17 15:18:00 -07:00
bitmap.h Public Release Prep 2010-05-17 15:18:00 -07:00
bootconf.h Public Release Prep 2010-05-17 15:18:00 -07:00
bootprops.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
buf.h Public Release Prep 2010-05-17 15:18:00 -07:00
byteorder.h Public Release Prep 2010-05-17 15:18:00 -07:00
callb.h Public Release Prep 2010-05-17 15:18:00 -07:00
cmn_err.h Public Release Prep 2010-05-17 15:18:00 -07:00
compress.h Public Release Prep 2010-05-17 15:18:00 -07:00
condvar.h Remove condition variable names 2012-04-06 12:06:19 -07:00
conf.h Public Release Prep 2010-05-17 15:18:00 -07:00
console.h Public Release Prep 2010-05-17 15:18:00 -07:00
cpupart.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
cpuvar.h Public Release Prep 2010-05-17 15:18:00 -07:00
crc32.h Public Release Prep 2010-05-17 15:18:00 -07:00
cred.h Add crgetfsuid()/crgetfsgid() helpers 2011-03-22 12:18:44 -07:00
ctype.h Public Release Prep 2010-05-17 15:18:00 -07:00
ddi.h Public Release Prep 2010-05-17 15:18:00 -07:00
debug.h Add --enable-debug-log configure option 2012-02-02 11:27:54 -08:00
dirent.h Public Release Prep 2010-05-17 15:18:00 -07:00
disp.h Add kpreempt_[dis|en]able macros in <sys/disp.h> 2012-08-24 15:18:38 -07:00
dkio.h Public Release Prep 2010-05-17 15:18:00 -07:00
dklabel.h Public Release Prep 2010-05-17 15:18:00 -07:00
dnlc.h Add dnlc_reduce_cache() support 2011-04-06 20:06:03 -07:00
dumphdr.h Public Release Prep 2010-05-17 15:18:00 -07:00
efi_partition.h Public Release Prep 2010-05-17 15:18:00 -07:00
errno.h Public Release Prep 2010-05-17 15:18:00 -07:00
extdirent.h Add missing headers 2011-01-27 16:06:09 -08:00
fcntl.h Use Linux flock struct 2011-02-23 14:32:15 -08:00
file.h Add FIGNORECASE define 2011-01-27 16:06:09 -08:00
idmap.h Add missing headers 2011-01-27 16:06:09 -08:00
int_limits.h Public Release Prep 2010-05-17 15:18:00 -07:00
int_types.h Public Release Prep 2010-05-17 15:18:00 -07:00
inttypes.h Public Release Prep 2010-05-17 15:18:00 -07:00
isa_defs.h Define the needed ISA types for ARM 2012-05-03 09:56:15 -07:00
kidmap.h Add missing headers 2011-01-27 16:06:09 -08:00
kmem.h Emergency slab objects 2012-08-27 12:00:42 -07:00
kobj.h Public Release Prep 2010-05-17 15:18:00 -07:00
kstat.h Add basic dynamic kstat support 2012-02-02 11:28:00 -08:00
list.h Add list_link_replace() function 2010-08-27 14:23:48 -07:00
mkdev.h Public Release Prep 2010-05-17 15:18:00 -07:00
mntent.h Public Release Prep 2010-05-17 15:18:00 -07:00
modctl.h Public Release Prep 2010-05-17 15:18:00 -07:00
mode.h Add vn_mode_to_vtype/vn_vtype to_mode helpers 2011-01-12 11:38:04 -08:00
mount.h Public Release Prep 2010-05-17 15:18:00 -07:00
mutex.h Fix usage of MUTEX macro in mutex_enter_nested 2011-12-13 11:04:21 -08:00
note.h Public Release Prep 2010-05-17 15:18:00 -07:00
open.h Public Release Prep 2010-05-17 15:18:00 -07:00
param.h Correct MAXUID 2011-04-29 13:58:45 -07:00
pathname.h Public Release Prep 2010-05-17 15:18:00 -07:00
policy.h Minor policy interface 2011-01-27 16:06:09 -08:00
pool.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
priv_impl.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
proc.h Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process) 2010-06-11 15:57:25 -07:00
processor.h Public Release Prep 2010-05-17 15:18:00 -07:00
pset.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
random.h Public Release Prep 2010-05-17 15:18:00 -07:00
refstr.h Public Release Prep 2010-05-17 15:18:00 -07:00
resource.h Public Release Prep 2010-05-17 15:18:00 -07:00
rwlock.h Linux 3.2 compat: rw_semaphore.wait_lock is raw 2012-01-11 16:28:05 -08:00
sdt.h Public Release Prep 2010-05-17 15:18:00 -07:00
sid.h Add ksid_index_t and ksid_t types 2011-01-27 16:06:09 -08:00
signal.h Split <sys/debug.h> header 2010-07-20 13:29:35 -07:00
stat.h Public Release Prep 2010-05-17 15:18:00 -07:00
stropts.h Public Release Prep 2010-05-17 15:18:00 -07:00
sunddi.h Remove Solaris module emulation 2012-05-18 13:57:44 -07:00
sunldi.h Public Release Prep 2010-05-17 15:18:00 -07:00
sysdc.h Stub out additional missing headers 2010-06-11 15:57:25 -07:00
sysevent.h Public Release Prep 2010-05-17 15:18:00 -07:00
sysmacros.h PowerPC Compatibility 2012-07-02 09:33:09 -07:00
systeminfo.h Read the /etc/hostid file directly. 2011-06-24 09:58:03 -07:00
systm.h Public Release Prep 2010-05-17 15:18:00 -07:00
t_lock.h Public Release Prep 2010-05-17 15:18:00 -07:00
taskq.h Store copy of tqent_flags prior to servicing task 2011-12-16 16:54:00 -08:00
thread.h PowerPC Compatibility 2012-07-02 09:33:09 -07:00
time.h Allow 64-bit timestamps to be set on 64-bit kernels 2011-12-12 11:06:03 -08:00
timer.h Minor cleanup and Solaris API additions. 2010-06-11 15:57:25 -07:00
tsd.h Prepend spl_ to all init/fini functions 2011-11-11 09:18:28 -08:00
types32.h Public Release Prep 2010-05-17 15:18:00 -07:00
types.h Linux 2.6.39 compat, zlib_deflate_workspacesize() 2011-04-20 14:39:15 -07:00
u8_textprep.h Public Release Prep 2010-05-17 15:18:00 -07:00
uio.h Add xuio_* structures and typedefs. 2010-06-11 15:57:25 -07:00
unistd.h Public Release Prep 2010-05-17 15:18:00 -07:00
utsname.h Public Release Prep 2010-05-17 15:18:00 -07:00
va_list.h Public Release Prep 2010-05-17 15:18:00 -07:00
varargs.h Public Release Prep 2010-05-17 15:18:00 -07:00
vfs_opreg.h Public Release Prep 2010-05-17 15:18:00 -07:00
vfs.h Renamed 'struct fid' for NFS 2011-04-29 12:10:54 -07:00
vmsystm.h Public Release Prep 2010-05-17 15:18:00 -07:00
vnode.h Linux 3.1 compat, kern_path_parent() 2011-11-09 16:51:25 -08:00
zmod.h Prepend spl_ to all init/fini functions 2011-11-11 09:18:28 -08:00
zone.h Public Release Prep 2010-05-17 15:18:00 -07:00