Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test. This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.
The test will first create 32 keys via tsd_create() and register
a common destructor. Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread. Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.
The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys. They then immediately
exit. This is is designed to verify tsd_exit() which will be
called via thread_exit(). This must result in all registered
destructors being run and the memory for the tsd being free'd.
After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd. At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.
If this all goes off without a hitch the test passes. To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.
autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.
This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().
We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported. The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs. As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:
internal error: Unknown error 524
Aborted
This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock(). If we set it to NULL this check
fails and the lockdep support is immediately disabled.
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These two lines were being rendered incorrectly on the GitHub
site. To fix the issue there needs to be leading whitespace
before each line to ensure each command is rendered on its
own line.
$ ./configure
$ make pkg
The wiki contents have been converted to html and made available
at their new home http://zfsonlinux.org. The wiki has also been
disabled the html pages are now the official documentation.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea. This commit also updates all of the
autoconf style support scripts in config.
Full update to date build information will stay on the wiki for
now, but there is no harm in adding the bare bones instructions
to the README. They shouldn't change and are a reasonable
quick start.
The list_link_replace() function with swap a new item it to the place
of an old item in a list. It is the callers responsibility to ensure
all lists involved are locked properly.
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
These functions were not previous needed so they were not added.
Now they are so add the full set.
atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO. In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow. This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The
original mask is then restored at vn_close() time. Hats off to the
loop driver which does something similiar for the same reason.
[...]
shrink_slab+0xdc/0x153
try_to_free_pages+0x1da/0x2d7
__alloc_pages+0x1d7/0x2da
do_generic_mapping_read+0x2c9/0x36f
file_read_actor+0x0/0x145
__generic_file_aio_read+0x14f/0x19b
generic_file_aio_read+0x34/0x39
do_sync_read+0xc7/0x104
vfs_read+0xcb/0x171
:spl:vn_rdwr+0x2b8/0x402
:zfs:vdev_file_io_start+0xad/0xe1
[...]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5. Details can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=526092
The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked(). The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.
This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present. The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers. Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.
Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros. We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.
Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner(). This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.
The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit. However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.
The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization. Since it is no longer used, I removed it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The RHEL5 2.6.18-194.7.1.el5 kernel added atomic64_cmpxchg to
asm-x86_64/atomic.h. That macro is defined in terms of cmpxchg which
is provided by asm/system.h. However, asm/system.h is not #included by
atomic.h in this kernel nor by the autoconf test for atomic64_cmpxchg, so
the test failed with "implicit declaration of function 'cmpxchg'". This
leads the build system to erroneously conclude that the kernel does not
define atomic64_cmpxchg and enable the built-in definition. This in
turn produces a '"atomic64_cmpxchg" redefined' build warning which is fatal
when building with --enable-debug. This commit fixes this by including
asm/system.h in the autoconf test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.
However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.
One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages. Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.
As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages. The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:
make rpm: Create .rpm packages
make deb: Create .deb packages
make tgz: Create .tgz packages
make pkg: Create the right package type for your distribution
The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:
1) Will not have the correct dependency information.
2) Will not not include the kernel version in the release.
3) Will not handle all differences between distributions.
But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
In cmd/splat.c there was a comparison between an __u32 and an int. To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.
In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg(). That in turn resulted
in some deadcode which should be removed since it's now useless.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.
However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.
Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility. If splat terminates before the threads, accesses to that memory
location by the other threads become invalid. Splat synchronizes with
the threads with the call:
wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:
wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory. The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation. Thus we end
up deadlocking.
A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS. This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().
The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP. This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced. The caller still should be able to reclaim memory from
the various slab caches. We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages. This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.
We shall see.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.