Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do. To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.
Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code. This simplification
means we only have one version of the helper to keep to to date.
Certain stock kernels (Debian Lenny) are built with zlib_inflate.ko
as a kernel module. To ensure 'make check' works in-tree load this
module before loading the spl module. This is now required for the
zlib splat regression test.
As part of vmalloc() a __pte_alloc_kernel() allocation may occur. This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur. This reclaim can trigger file IO which
can result in a deadlock. This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.
An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3). This code can be properly autoconf'ed
when the upstream bug is fixed.
1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.
However, we still need to fundamentally handle the issue. Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed. We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS. This ensures that in these common
I/O path we don't trigger additional reclaim. This minimizes the
change to the Solaris code.
Explicitly include the linux/seq_file.h header in vfs.h. This header
is required for the sequence handlers and is included indirectly in
newer kernels.
Detect early on in configure if the Modules.symvers file is missing.
Without this file there will be build failures later and it's best
to catch this early and provide a useful error. In this case the
most likely problem is the kernel-devel packages are not installed.
It may also be possible that they are using an unbuilt custom kernel
in which case they must build the kernel first.
Until support is added for preemptible kernels detect this at
configure time and make it fatal. Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines. This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.
This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support. By removing it from the spl we ensure not conflict
with the higher level packages.
This just leaves minimal vnode support for basical manipulation
of files. This code is does have the proper support functions
in the spl and a set of regression tests.
Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress. The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place. In reality the current implementation
was non-functional, it just was compilable.
The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use. A kmem_cache was
added for the workspace area because we require a large chunk
of memory. This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow. Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release. A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit. Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules. This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.
Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address. The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn. Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address. All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.
Long term we should try to get this, or another, interface made
available to modules again.
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links. Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with. This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
Roll the version forward to 0.6.0. While no major changes
really warrant this I want to keep the version in step with
ZFS for now which is the only SPL consumer.
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Simply add the policy function wrappers. They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
Dependent packages require the following missing headers to
simplify compilation. The headers are basically just stubbed
out with minimal content required.
The following flags are use to get the proper mask when getting
and setting ACLs. I'm hopeful this can all largely go away at
some point.
We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
For Linux the maximum uid can vary depending on how your kernel
is built. The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?). Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers. It also makes some minor fid_t
additions for compatibility.
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking. This causes processes
to go in the D state which increases the load average. The load
average is the summation of processes in D state and run queue.
To avoid this it can be desirable to sleep interruptibly. These
processes do not count against the load average but may be woken by
a signal. It is up to the caller to determine why the process
was woken it may be for one of three reasons.
1) cv_signal()/cv_broadcast()
2) the timeout expired
3) a signal was received
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test. This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.
The test will first create 32 keys via tsd_create() and register
a common destructor. Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread. Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.
The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys. They then immediately
exit. This is is designed to verify tsd_exit() which will be
called via thread_exit(). This must result in all registered
destructors being run and the memory for the tsd being free'd.
After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd. At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.
If this all goes off without a hitch the test passes. To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.
autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.
This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().
We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs. As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:
internal error: Unknown error 524
Aborted
This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock(). If we set it to NULL this check
fails and the lockdep support is immediately disabled.
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These two lines were being rendered incorrectly on the GitHub
site. To fix the issue there needs to be leading whitespace
before each line to ensure each command is rendered on its
own line.
$ ./configure
$ make pkg
The wiki contents have been converted to html and made available
at their new home http://zfsonlinux.org. The wiki has also been
disabled the html pages are now the official documentation.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.