This is optional functionality which may or may not be useful to
ZFS when using older kernels. It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.
In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS. As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.
Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve. This greatly simplifies the memory management
code and eliminates a common area of confusion.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The get_vmalloc_info() function was used to back the vmem_size()
function. This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.
However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mutex_lock_nested() function has been available since Linux 2.6.18.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16. The compatibility code to check for
the previous semaphore primitive has been removed. However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The kmalloc_node() function has been available since Linux 2.6.12.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uaccess header has been available in the same location since
Linux 2.6.18. There is no longer a need to maintain this
compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The uintptr_t typedef has been available since Linux 2.6.24.
There is no longer a need to maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The atomic64_xchg() and atomic64_cmpxchg() functions have been
available since Linux 2.6.24. There is no longer a need to
maintain this compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues. However, as of Linux 2.6.26
all the required functionality is available. This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64(). This allows us
to provide an optimized implementation and simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19. There is no longer any reason to maintain this
compatibility code. There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly. This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The generic SPL cache shrinkers make the assumption that the
caches only contain VFS cache data and therefore should be scaled
based on vfs_cache_pressure. This is not strictly true and it
should not be assumed.
Removing this tuning should not have any impact on the stock
behavior because vfs_cache_pressure=100 by default. This means
that no scaling will take place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency throughout the code update the SPLAT infrastructure
to use the wrapped mutex interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
New versions of dkms clean up the build directory after installing.
It appears that this was always intended, but had rm -rf "/path/to/build/*"
(note the quotes), which prevented it from working.
Also, the build step is already installing stuff into the directory where
these files go, so installing our stuff there as part of build rather than
install makes sense.
Signed-off-by: Tom Prince <tom.prince@clusterhq.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#399
While running SPLAT on a kernel with CONFIG_DEBUG_ATOMIC_SLEEP
enabled the taskq:front was flagged as a test which might sleep
which in an unsafe context. Specifically, the splat_vprint()
function which internally takes a mutex was being called under
a spin lock. Moving the log function outside the spin lock
cleanly solves this issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic. Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().
This has the advantage of being a stable functionally equivalent
interface. On many architectures smp_mb__after_clear_bit() expands
to smp_wmb(). Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#386
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.
Signed-off-by: stf <s@ctrlc.hu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#382
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
For small objects the Linux slab allocator should be used to make the most
efficient use of the memory. However, large objects are not supported by
the Linux slab and therefore the SPL implementation is preferred. A cutoff
of 16K was determined to be optimal for architectures using 4K pages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #356Closes#379
Reinstate the correct default behavior of returning the number of objects
in the cache for reclaim. This behavior was disabled in recent releases
to do occasional reports of spinning in shrink_slabs(). Those issues have
been resolved and can no longer can be reproduced. See commit 376dc35.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #358Closes#379
The atomic_swap_32() function maps to atomic_xchg(), and
the atomic_swap_64() function maps to atomic64_xchg().
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#377
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue. Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#374
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.
The following macros have been converted to not use do/while(0):
PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL
PANIC has been converted to a wrapper around the new spl_PANIC() function.
The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().
The __ASSERT() macro was not touched. It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Set LANG=C before calling 'rpmbuild' to avoid rpmbuild failing on
the translated date string in the changelog.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306
Running 'yum upgrade spl-dkms' package could appear to work properly
and still leave you with no spl modules installed. This will occur
when only the spl release, and not the version, are incremented.
This may be the case for a fast moving spl-testing repository.
During the upgrade process DKMS will realize that spl-x.y.z is already
installed and remove it. DKMS then correctly builds the new modules
for spl-x.y.z. However, as a final step when the old spl-x.y.z-r is
removed the %preun script runs and removes the newly build modules.
To handle this case the %preun script has been updated to only run
when the installed version exactly matches the full spec file version.
This change also updated ChangeLog section based on the DKMS
reference spec file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating packages in a git repository the release number
can be automatically set by 'git describe'. This normally works
well but if your repository has newer tags which match the form
NAME-VERSION* the release may be incorrectly calculated. To
prevent this the match patten has been restricted to NAME-VERSION.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The correct behavior for all registered shrinkers is to return the
number of objects in their cache. In theory this allows the Linux
VM to balance memory reclaim across all registered caches.
In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim. This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim. Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.
Therefore, this patch makes this behavior tunable. Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once. This is the default behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#358
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
Detect the updated vfs_rename() interface and call it with an
extra flags argument.
References:
torvalds/linux@520c8b1
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
These macro's were exposed to make them available to other
parts of the kernel and modules.
References:
torvalds/linux@6b6350f
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.
The implementation based on assumption that task queues initialization
performed in serial manner.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#336
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.
The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#251
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).
As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#351
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#349
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.
Problem:
Reasons for very high number of virtual memory allocations:
* Real slab size very small relative to the size of the entire RAM
* Slabs allocated on virtual memory and fill entire ARC
The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.
Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#337
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.
This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#335
This macro makes the compile to spit "mixed definition and code"
warning, I can't find a way to avoid it.
This patch lays some groundwork for the persistent l2arc feature.
See https://www.illumos.org/issues/3525.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#303
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist
Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#224
The kernel's kthread_create() function is defined as "..." and there is
no va_list variant at the moment. The task name is pre-formatted into
a local buffer and passed to kthread_create() with fixed arguments.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#347