To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory. The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation. Thus we end
up deadlocking.
A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().
The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP. This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced. The caller still should be able to reclaim memory from
the various slab caches. We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages. This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.
We shall see.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
It turns out Solaris incidentally includes kstat.h from kmem.h. As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h. To make like easier for everyone I do the same.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
Because vmem_free() was implemented as a macro using the ','
operator to evaluate both arguments and we performed the free
before evaluating size we would deference the free'd pointer.
To resolve the problem we just invert the ordering and evaluate
size first just as if it was evaluated by the caller when being
passed to this function. This ensure that if the caller is
doing something reckless like performing an assignment as
part of the size argument we still perform it and it simply
doesn't get removed by the macro. Oh course nobody should
be doing this sort of thing, but just in case.
- The previous magazine ageing sceme relied on the on_each_cpu()
function to call spl_magazine_age() on each cpu. It turns out
this could deadlock with do_flush_tlb_all() which also relies
on the IPI based on_each_cpu(). To avoid this problem a per-
magazine delayed work item is created and indepentantly
scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
spl_kmem_cache_t, they were hold overs from previous cleanup.
- struct work_struct work
- struct timer_list timer
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
spl_slab_reclaim() to cap the number of slabs reclaimed.
On Solaris all eligible slabs are always reclaimed, and this
is still the default behavior. However, I suspect that is
not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
slab/object free lists. Soft lockups were observed when
freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
incorrect 'break' to 'continue' to ensure all slabs are
checked.
- spl_cache_age() reworked to avoid a deadlock with
do_flush_tlb_all() which occured because we slept waiting
for completion in spl_cache_age(). To waiting for magazine
reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
back to using for_each_online_cpu() instead of the
spl_on_each_cpu() approach which was of course a bad idea
due to memory allocations which Ricardo pointed out.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
- Added slab work queue task which gradually ages and free's slabs
from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock. We
now unlink the objects and slabs from the cache itself and attach
them to a private work list. The contents of the list are then
subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
per-cache skc_lock. This includes the addition of KMC_BIT_REAPING
bit mask which is used to prevent concurrent reaping, and to defer
new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
destroyed, this is used to catch any task accessing the cache while
it is being destroyed.
- Add comments to all the functions and additional comments to try
and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
rigerously stress the cache implementation and look for any problems.
This includes correctness and performance tests.
- Updated portable work queue interfaces
That said when working with a finite resource like memory failure really
is always a possibility. It would be far better longer term if the ZFS
code could be weened off this assumption and properly handle the cases
where an allocation fails. Still I've applied the patch to spl-0.3.4
since this layer is supposed to emulate Solaris as closely as possible.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@164 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
already supprt atomic64_t types.
* spl-07-kmem-cleanup.patch
This moves all the debugging code from sys/kmem.h to spl-kmem.c, because
the huge macros were hard to debug and were bloating functions that
allocated memory. I also fixed some other minor problems, including
32-bit fixes and a reported memory leak which was just due to using the
wrong free function.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@163 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
spl-05-div64.patch
This is a much less intrusive fix for undefined 64-bit division symbols
when compiling the DMU in 32-bit kernels.
* spl-06-atomic64.patch
This is a workaround for 32-bit kernels that don't have atomic64_t.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@162 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
at a time as I audit it. This chunk finishes moving the SPL entirely
off the linux slab on to the SPL implementation. It differs slightly
from the proposed version in that the spl continues to export to
all the Solaris types and functions. These do conflict with the
Linux slab so a module usings these interfaces must not include the
SPL slab if they also intend to use the linux slab. Or they must
explcitly #undef the macros which remap the functioin to their
spl_* equivilants.
A nice side of effect of dropping the entire linux slab is we
don't need to autoconf checks anymore. They kept messing with
the slab API endlessly!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@148 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
based of the spl_kmem_obj_t tacked on the end of each object.
This actually isn't so back because we are now allocing large
chunks for the slab and partitioning it ourselves. So there's
not a ton of wasted space. We may suffer a performance hit
however due to alignment issues.
- Remove remaining depenancies on the linux slab implementation.
We're standing on our own now for better or worse.
- Rework slabs to be either kmem or vmem based. If neither
KMC_VMEM of KMC_KMEM are specified we make a decent guess
about what will work best for their based on the object
size. Additionally we provide a kmem_virt() function caller
can use to see if they have a virtual or physical address.
- Minor fixups in the test suite.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@141 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
cycle count which was costing me overhead. It was hurting
performance pretty badly for heavily used caches. I'm also
thinking the hash may be hurting me as well and it might
be worth sticking a pointer in to a little space after the
alloced object.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@140 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
based by vmalloc()'ed memory. I now alloc a slab which is
roughly 32*spl_obj_size and in this block of memory I place
the slab descriptor, slab object descriptors, and objects
themselves. This greatly reduces vmalloc lock contention.
Still some minor cleanup remains and fine tuning but
it's working pretty well.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@139 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
well for the expected workloads. Improvement in this commit include:
- Added DEBUG_KMEM_TRACKING #define which can optionally be set
when DEBUG_KMEM is defined to do per allocation tracking. This
allows us to get all the lightweight kmem debugging enabled by
default which is pretty light weight, and only when looking
for a memory leak we can briefly enable the per alloc tracking.
- Added set_normalized_timespec() in to SPL to simply using
the timespec() primatives from within a module.
- Added per-spinlock cycle counters to the slab in an attempt
to run down a lock contention issue. The contended lock
was in vmalloc() but I'm going to leave the cycle counters
in place for a little while until I'm convinced there arn't
other locking improvement possible in the slab.
- Added a proc interface to the slab to export per slab
cache statistics to /proc/spl/kmem/slab for analysis.
- Reworked spl_slab_alloc() function to allocate from kmem for
small allocation and vmem for large allocations. This improved
things considerably but futher work is needed.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@138 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
when repopulating it. Plus I fixed a few more suble races in
that part of the code which were catching me. Finally I fixed
a small race in kmem_test8.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@137 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed. We can still further
improve things on NUMA systems by creating an additional L3 cache per
memory node instead of the current global pool. With luck this won't
be needed. I should also take another look at the locking now that
everything is working. There's a good chance I can tighten it up a
little bit and improve things a little more.
kmem_lock: time (sec) slabs objs hash
kmem_lock: tot/max/calc tot/max/calc size/depth
kmem_lock: 0.000999926 6/6/1 192/192/32 32768/0
kmem_lock: 0.000999926 4/4/2 128/128/64 32768/0
kmem_lock: 0.000999926 4/4/4 128/128/128 32768/0
kmem_lock: 0.000999926 4/4/8 128/128/256 32768/0
kmem_lock: 0.000999926 4/4/16 128/128/512 32768/0
kmem_lock: 0.000999926 4/4/32 128/128/1024 32768/0
kmem_lock: 0.000999926 4/4/64 128/128/2048 32768/0
kmem_lock: 0.000999926 8/8/128 256/256/4096 32768/0
kmem_lock: 0.003999704 24/23/256 768/736/8192 32768/1
kmem_lock: 0.012999038 44/41/512 1408/1312/16384 32768/1
kmem_lock: 0.051996153 96/93/1024 3072/2976/32768 32768/2
kmem_lock: 0.181986536 187/184/2048 5984/5888/65536 32768/3
kmem_lock: 0.655951469 342/339/4096 10944/10848/131072 32768/4
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@136 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
to be overly clever and the context switch when the semaphore was busy
was destroying performance. Converting to a simple spin lock bough me
a factor of 50 or so. That said it's still not good enough. Tests
show bad performance and we are still CPU bound. The logical fix is
I need to implement per-cpu hot caches to minimize the SMP contention.
Linux and Solaris both have this, I was hoping to do without but it
looks like that's not to be.
kmem_lock: time (sec) slabs objs hash
kmem_lock: tot/max/calc tot/max/calc size/depth
kmem_lock: 0.022000000 7/6/64 224/177/2048 32768/1
kmem_lock: 0.039000000 13/13/128 416/404/4096 32768/1
kmem_lock: 0.079000000 23/21/256 736/672/8192 32768/1
kmem_lock: 0.158000000 48/47/512 1536/1504/16384 32768/1
kmem_lock: 0.345000000 105/105/1024 3360/3358/32768 32768/2
kmem_lock: 0.760000000 202/200/2048 6464/6400/65536 32768/3
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@135 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
longer be based on the linux slab but to be its own complete
implementation. The new slab behaves much more like the
Solaris slab than the Linux slab.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@132 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
working on this branch for the next few days I suggested you work
off of the 0.3.1 tag. The following changes are fairly extensive
and are designed to make the SPL compatible with all kernels in
the range of 2.6.18-2.6.25. There were 13 relevant API changes
between these releases and I have added the needed autoconf tests
to check for them. However, this has not all been tested extensively.
I'll sort of the breakage on Fedora Core 9 and RHEL5 this week.
SPL_AC_TYPE_UINTPTR_T
SPL_AC_TYPE_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_DESTROY_INT
SPL_AC_ATOMIC_PANIC_NOTIFIER
SPL_AC_3ARGS_INIT_WORK
SPL_AC_2ARGS_REGISTER_SYSCTL
SPL_AC_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_CREATE_DTOR
SPL_AC_3ARG_KMEM_CACHE_CREATE_CTOR
SPL_AC_SET_SHRINKER
SPL_AC_PATH_IN_NAMEIDATA
SPL_AC_TASK_CURR
SPL_AC_CTL_UNNUMBERED
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@119 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
compiled out when doing performance runs.
- Bite the bullet and fully autoconfize the debug options in the configure
time parameters. By default all the debug support is disable in the core
SPL build, but available to modules which enable it when building against
the SPL. To enable particular SPL debug support use the follow configure
options:
--enable-debug Internal ASSERTs
--enable-debug-kmem Detailed memory accounting
--enable-debug-mutex Detailed mutex tracking
--enable-debug_kstat Kstat info exported to /proc
--enable-debug-callb Additional callb debug
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@111 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
may not fail. To get this behavior I'd added a retry to the shim layer
even though it is abusive to the VM, at least it should prevent the crash.
Additionally I added a proc counter so I can easily check how often this
is happening. It should be fairly rare, but likely will get worse and
worse the longer the machine has been up.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@104 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
- Detailed kmem memory allocation tracking. We can now get on
spl module unload a list of all memory allocations which were
not free'd and where the original alloc was. E.g.
SPL: 15554:632:(spl-kmem.c:442:kmem_fini()) kmem leaked 90/319332 bytes
SPL: 15554:648:(spl-kmem.c:451:kmem_fini()) address size data func:line
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b68b8 32 0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b6148 13 &tl->tl_lock __spl_mutex_init:74
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac43730 32 0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac437d8 13 &tl->tl_lock __spl_mutex_init:74
- Shift to using rwsems in kmem implmentation, to simply locking and
improve concurency.
- Shift to using rwsems in mutex implementation, additionally ensure we
never sleep in the init function if non-zero preempt_count or
interrupts are disabled as can happen in a slab cache ctor/dtor.
- Other minor formating fixes and such.
TODO:
- Finish the vmem memory allocation tracking
- Vet all other SPL primatives for potential sleeping during *_init. I
suspect the rwlock implemenation does this and should be fixes just
like the mutex implemenation.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@95 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
crashes but it's not clear to me yet if these are a problem with
the mutex implementation or ZFSs usage of it.
Minor taskq fixes to add new tasks to the end of the pending list.
Minor enhansements to the debug infrastructure.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@94 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
do not have __GFP_ZERO set. Once the memory is allocated
then zero out the memory if __GFP_ZERO is passed to
__vmem_alloc.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@88 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
- Replacing all BUG_ON()'s with proper ASSERT()'s
- Using ENTRY,EXIT,GOTO, and RETURN macro to instument call paths
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@78 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
it should be dropped to one page but in the short term we should be able
to easily live with 4 page allocations.
Fix the nvlist bug, it turns out the user space side of things were
packing the nvlists correctly as little endian, and the kernel space
side of things due to a missing #define were unpacking them as big endian.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@62 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c