Update the the wrapper macros for the memory shrinker to handle
this 4th API change. The callback function now takes a
shrink_control structure. This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes(). This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block. When only the legacy API is available the second
argument will be dropped for compatibility.
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache. After the entries are
pruned any slabs which they may have been using are reaped.
Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count. However, in practice this doesn't need to be
exactly correct. We simply need to reclaim some useful fraction
(but not all) of the cache. The caller can determine if more
needs to be done.
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header. The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.
SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
As part of vmalloc() a __pte_alloc_kernel() allocation may occur. This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur. This reclaim can trigger file IO which
can result in a deadlock. This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.
An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3). This code can be properly autoconf'ed
when the upstream bug is fixed.
1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
This fix was long overdue. Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call. However,
probably due to lack of time at the moment that informatin never
made it in to the error message. This patch fixes that and trys
to standardize the kmem debug messages as well.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN. This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.
This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise. Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.
Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks. To help prvent this limit
max kmem based slab size to MAX_ORDER-3. Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
The big fix here is the removal of kmalloc() in kv_alloc(). It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned. This is no longer true atleast in 2.6.31
there are no longer any alignment expectations. Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.
As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures. This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account. The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.
Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.
Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state(). The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it. Several
autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
be to dynamically aquire it at load time. This is an absolute
last resort for old kernel which we don't want to patch to
cleanly export the symbol.
- Initial SLES testing uncovered a long standing bug in the debug
tracing. The tcd_for_each() macro expected a NULL to terminate
the trace_data[i] array but this was only ever true due to luck.
All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
at the address of the memory location returned by kallsyms_lookup_name().
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
Instead for_each_zone() operated directly on pgdat_list which may or
may not have been exported depending on how your kernel was compiled.
Now new configure checks determine if you have the helpers or not, and
if the needed symbols are exported. If they are not exported then they
are dynamically aquired at runtime by kallsyms_lookup_name().
- Configure check, the div64_64() function was renamed to
div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
in 2.6.18 kernels. The earlier 2.6.16 based SLES10 must not try
and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
the HZ division as an 'unsigned long long' because the SPL only
implements __udivdi3(), and not __divdi3() for 'long long' division
on 32-bit arches.
In the interests of portability I have added a FC10/i686 box to
my list of development platforms. The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support. This patch
resolves all remaining issues observed under that environment.
1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added. As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages. The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state(). If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.
2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality. The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.
3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory. On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space. The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.
4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages(). This provides
additional test coverage of the SPL Linux VM integration.
5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M. On x86 based systems it is not possible to create
a kmem backed slab with entires of that size. To compensate for
this the number of allocations performed in increased by 4x.
6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.
7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol. This causes the module to fail
to load and we need to know exactly which symbol was not available.
An update to the build system to properly support all commonly
used Makefile targets these include:
make all # Build everything
make install # Install everything
make clean # Clean up build products
make distclean # Clean up everything
make dist # Create package tarball
make srpm # Create package source RPM
make rpm # Create package binary RPMs
make tags # Create ctags and etags for everything
Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels. To build binary
RPMs from the source RPM for your system simply run:
rpmbuild --rebuild spl-x.y.z-1.src.rpm
This will produce two binary RPMs with correct 'requires'
dependencies for your kernel. One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.
spl-x.y.z-1_<kernel version>.x86_64.rpm
spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
- The previous magazine ageing sceme relied on the on_each_cpu()
function to call spl_magazine_age() on each cpu. It turns out
this could deadlock with do_flush_tlb_all() which also relies
on the IPI based on_each_cpu(). To avoid this problem a per-
magazine delayed work item is created and indepentantly
scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
spl_kmem_cache_t, they were hold overs from previous cleanup.
- struct work_struct work
- struct timer_list timer
- spl_slab_reclaim() 'continue' changed back to 'break' from commit
37db7d8cf9. The original was correct,
I have added a comment to ensure this does not happen again.
- spl_slab_reclaim() further optimized by moving the destructor call
in spl_slab_free() outside the skc->skc_lock. This minimizes the
length of time the spin lock is held, allows the destructors to
be invoked concurrently for different objects, and as a bonus makes
it safe (although unwise) to sleep in the destructors.
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
spl_slab_reclaim() to cap the number of slabs reclaimed.
On Solaris all eligible slabs are always reclaimed, and this
is still the default behavior. However, I suspect that is
not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
slab/object free lists. Soft lockups were observed when
freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
incorrect 'break' to 'continue' to ensure all slabs are
checked.
- spl_cache_age() reworked to avoid a deadlock with
do_flush_tlb_all() which occured because we slept waiting
for completion in spl_cache_age(). To waiting for magazine
reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
back to using for_each_online_cpu() instead of the
spl_on_each_cpu() approach which was of course a bad idea
due to memory allocations which Ricardo pointed out.
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning. On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.
All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior. Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
- Added slab work queue task which gradually ages and free's slabs
from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock. We
now unlink the objects and slabs from the cache itself and attach
them to a private work list. The contents of the list are then
subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
per-cache skc_lock. This includes the addition of KMC_BIT_REAPING
bit mask which is used to prevent concurrent reaping, and to defer
new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
destroyed, this is used to catch any task accessing the cache while
it is being destroyed.
- Add comments to all the functions and additional comments to try
and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
rigerously stress the cache implementation and look for any problems.
This includes correctness and performance tests.
- Updated portable work queue interfaces