The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread. All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.
The implementation can be best understood by inspecting the RT
patch. rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.
Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support. Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes zfsonlinux/zfs#5491
Closes#589Closes#308
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type. A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro. See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#563
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.
This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes#554
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms. It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held. On other platforms the lock is never released during
the upgrade process. This is necessary under Linux because the kernel
does not provide an upgrade function.
There are currently no callers in the ZFS code where this change in
behavior is a problem. In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes#534
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.
This means we need to track the lock owner which is not something
tracked in a Linux semaphore. After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools. My reasons:
1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel. This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant. This is important because it then maintains the location
information for each rw_* call point.
2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself. This removes
the need for a fancy lookup to get the owner which is optimal for
performance. We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.
3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.
4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization. This is good for all
the obvious reasons, we do give up the ability to specific the lock
name. The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.
Update rwlocks validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools. This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.
The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked. Since the
rwsem was embedded in a wrapper structure the name was always the
same. The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.
The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure. This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible. It can only be done by directly accessing some of
the internal data member of the rwsem structure.
For example, the Linux API provides a different function for dropping
a reader vs writer lock. Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is. This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock. Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation. This is
all do able, and what I did, but it does complicate things considerably.
The good news is that in addition to the profiling benefits of this
change. We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.
The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem. Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.
In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests. Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.
This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing. The full
SPLAT regression test suite was run numberous times on all of the following
platforms. This includes various kernels ranging from 2.6.16 to 2.6.29.
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)