Commit 3ee56c292b changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code. However, use of ENOTSUP is widespread in the zfs module. Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space. The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
Support for rolling back datasets require a functional ZPL, which we currently
do not have. The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device. To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is
consistent with the behavior of other ZPL dependent commands such as mount.
The orginal error message observed with this bug was rather confusing:
internal error: Unknown error 524
Aborted
This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code. It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space. With that we would have gotten
the somewhat more helpful message
cannot rollback 'tank/fish': unsupported version
This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL. But, this change updates the error code just in
case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems. More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.
Linux kernel thread names are expected to be short. This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.
On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.
Closes#73.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped. This was discovered while writing the zfault.sh
tests to validate common failure modes.
This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough. In this case 1024 byte is the
default. If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t. This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error. Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.
The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit. This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
By default the Solaris code does not log speculative or soft io errors
in either 'zpool status' or post an event. Under Linux we don't want
to change the expected behavior of 'zpool status' so these io errors
are still suppressed there.
However, since we do need to know about these events for Linux FMA and
the 'zpool events' interface is new we do post the events. With the
addition of the zio_flags field the posted events now contain enough
information that a user space consumer can identify and discard these
events if it sees fit.
All the upper layers of zfs expect zio->io_error to be positive. I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error. To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().
Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:
ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
Observed during failure mode testing, dsl_scan_setup_sync() allocates
73920 bytes. This is way over the limit of what is wise to do with a
kmem_alloc() and it should probably be moved to a slab. For now I'm
just flagging it with KM_NODEBUG to quiet the error until this can be
revisited.
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created. The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output. It is also used
to determine how writeback cache should be set for a device.
When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label(). The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc(). Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.
We also add an ASSERT that the whole_disk property is set. We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set. The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing. If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This callback is needed for properly accounting the per-uid and per-gid
space usage. Even if we don't have the ZPL, we still need this callback
in order to have proper on-disk ZPL compatibility and to be able to use
Lustre quotas.
Fortunately, the callback doesn't have any ZPL/VFS dependencies so we
can just move it out of #ifdef HAVE_ZPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Required for the DB_DNODE_ENTER()/DB_DNODE_EXIT() helpers.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In my machine, dnode_hold_impl() allocates 9992 bytes in DEBUG mode and it
causes a large stream of stack traces in the logs. Instead, use KM_NODEBUG
to quiet down this known large alloc.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz
tar -xzf zfs-x.y.z.tar.gz
cd zfs-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This change also moves many of the include headers from individual
incude/sys directories under the modules directory in to a single
top level include directory. This has the advantage of making
the build rules cleaner and logically it makes a bit more sense.
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory. The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.
For example, this project is designed to work on various different
Linux distributions each of which work slightly differently. This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.
Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution. When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.
wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z
------------------------- run concurrently ----------------------
<ubuntu system> <fedora system> <debian system> <rhel6 system>
mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6
cd ubuntu cd fedora cd debian cd rhel6
../configure ../configure ../configure ../configure
make make make make
make check make check make check make check
This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
This topic branch contains all the changes needed to integrate the user
side zfs tools with Linux style devices. Primarily this includes fixing
up the Solaris libefi library to be Linux friendly, and integrating with
the libblkid library which is provided by e2fsprogs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream ZFS code has correctly moved to a faster native sha2
implementation. Unfortunately, under Linux that's going to be a little
problematic so we revert the code to the more portable version contained
in earlier ZFS releases. Using the native sha2 implementation in Linux
is possible but the API is slightly different in kernel version user
space depending on which libraries are used. Ideally, we need a fast
implementation of SHA256 which builds as part of ZFS this shouldn't be
that hard to do but it will take some effort.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This branch contains the majority of the changes required to cleanly
intergrate with Linux style special devices (/dev/zfs). Mainly this
means dropping all the Solaris style callbacks and replacing them
with the Linux equivilants.
This patch also adds the onexit infrastructure needed to track
some minimal state between ioctls. Under Linux it would be easy
to do this simply using the file->private_data. But under Solaris
they apparent need to pass the file descriptor as part of the ioctl
data and then perform a lookup in the kernel. Once again to keep
code change to a minimum I've implemented the Solaris solution.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ZFS update to onnv_141 brought with it support for a
security label attribute called mlslabel. This feature
depends on zones to work correctly and thus I am disabling
it under Linux. Equivilant functionality could be added
at some point in the future.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This topic branch leverages the Solaris style FMA call points
in ZFS to create a user space visible event notification system
under Linux. This new system is called zevent and it unifies
all previous Solaris style ereports and sysevent notifications.
Under this Linux specific scheme when a sysevent or ereport event
occurs an nvlist describing the event is created which looks almost
exactly like a Solaris ereport. These events are queued up in the
kernel when they occur and conditionally logged to the console.
It is then up to a user space application to consume the events
and do whatever it likes with them.
To make this possible the existing /dev/zfs ABI has been extended
with two new ioctls which behave as follows.
* ZFS_IOC_EVENTS_NEXT
Get the next pending event. The kernel will keep track of the last
event consumed by the file descriptor and provide the next one if
available. If no new events are available the ioctl() will block
waiting for the next event. This ioctl may also be called in a
non-blocking mode by setting zc.zc_guid = ZEVENT_NONBLOCK. In the
non-blocking case if no events are available ENOENT will be returned.
It is possible that ESHUTDOWN will be returned if the ioctl() is
called while module unloading is in progress. And finally ENOMEM
may occur if the provided nvlist buffer is not large enough to
contain the entire event.
* ZFS_IOC_EVENTS_CLEAR
Clear are events queued by the kernel. The kernel will keep a fairly
large number of recent events queued, use this ioctl to clear the
in kernel list. This will effect all user space processes consuming
events.
The zpool command has been extended to use this events ABI with the
'events' subcommand. You may run 'zpool events -v' to output a
verbose log of all recent events. This is very similar to the
Solaris 'fmdump -ev' command with the key difference being it also
includes what would be considered sysevents under Solaris. You
may also run in follow mode with the '-f' option. To clear the
in kernel event queue use the '-c' option.
$ sudo cmd/zpool/zpool events -fv
TIME CLASS
May 13 2010 16:31:15.777711000 ereport.fs.zfs.config.sync
class = "ereport.fs.zfs.config.sync"
ena = 0x40982b7897700001
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0xed976600de75dfa6
(end detector)
time = 0x4bec8bc3 0x2e5aed98
pool = "zpios"
pool_guid = 0xed976600de75dfa6
pool_context = 0x0
While the 'zpool events' command is handy for interactive debugging
it is not expected to be the primary consumer of zevents. This ABI
was primarily added to facilitate the addition of a user space
monitoring daemon. This daemon would consume all events posted by
the kernel and based on the type of event perform an action. For
most events simply forwarding them on to syslog is likely enough.
But this interface also cleanly allows for more sophisticated
actions to be taken such as generating an email for a failed drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add autoconf style build infrastructure to the ZFS tree. This
includes autogen.sh, configure.ac, m4 macros, some scripts/*,
and makefiles for all the core ZFS components.
Due to limited stack space recursive functions are frowned upon in
the Linux kernel. However, they often are the most elegant solution
to a problem. The following code preserves the recursive function
traverse_visitbp() but moves the local variables AND function
arguments to the heap to minimize the stack frame size. Enough
space is initially allocated on the stack for 20 levels of recursion.
This change does ugly-up-the-code but it reduces the worst case
usage from roughly 4160 bytes to 960 bytes on x86_64 archs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Implement zio_execute() as a wrapper around the static function
__zio_execute() so that we can force __zio_execute() to be inlined.
This reduces stack overhead which is important because __zio_execute()
is called recursively in several zio code paths. zio_execute() itself
cannot be inlined because it is externally visible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Eliminated local variables pointing to members of the zio struct.
Just refer to the struct members directly. This saved about 32 bytes per
call, but this function can be called recurisvely up to 19 levels deep,
so we potentially save up to 608 bytes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Certain function must never be automatically inlined by gcc because
they are stack heavy or called recursively. This patch flags all
such functions I've found as 'noinline' to prevent gcc from making
the optimization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce kernel stack usage by lzjb_compress() by moving uint16 array
off the stack and on to the heap. The exact performance implications
of this I have not measured but we absolutely need to keep stack
usage to a minimum. If/when this becomes and issue we optimize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Decrease stack usage for various call paths by forcing certain
functions to be inlined. By inlining the functions the overhead
of a new stack frame is removed at the cost of increased code size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To reduce stack overhead this topic branch moves the 128 byte
blkptr_t data strucutre in dsl_scan_visitbp() to the heap.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce stack usage in dsl_deleg_get, gcc flagged it as consuming a
whopping 1040 bytes or potentially 1/4 of a 4K stack. This patch
moves all the large structures and buffer off the stack and on to
the heap. This includes 2 zap_cursor_t structs each 52 bytes in
size, 2 zap_attribute_t structs each 280 bytes in size, and 1
256 byte char array. The total saves on the stack is 880 bytes
after you account for the 5 new pointers added.
Also the source buffer length has been increased from MAXNAMELEN
to MAXNAMELEN+strlen(MOS_DIR_NAME)+1 as described by the comment in
dsl_dir_name(). A buffer overrun may have been possible with the
slightly smaller buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move dsl_dataset_t local variable from the stack to the heap.
This reduces the stack usage of this function from 2048 bytes
to 176 bytes for x84_64 arches.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce stack usage by 276 bytes by moving the snaparg struct from the
stack to the heap. We have limited stack space we must not waste.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit preserves the recursive function dbuf_hold_impl() but moves
the local variables and function arguments to the heap to minimize
the stack frame size. Enough space is initially allocated on the
stack for 20 levels of recursion. This technique was based on commit
34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
traverse_visitbp().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dnode_move() functionality is only used in the kernel build.
As such we should be careful to wrap all of the related code
with '#ifdef _KERNEL' to avoid gcc warnings about unused code.
Interestingly this looks like an upstream bug as well. If for some
reason we are unable to get a zvols statistics, because perhaps the
zpool is hopelessly corrupt, we would trigger the VERIFY. This
commit adds the proper error handling just to propagate the error
back to user space. Now the user space tools still must handle this
properly but in the worst case the tool will crash or perhaps have
some missing output. That's far far better than crashing the host.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The zio_taskq_dispatch() function may be called at interrupt time
and it is critical that we never sleep.
Additionally, wrap taskq_dispatch() in a while loop because it may
fail. This is non optimal but is OK for now.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Do not use zmod.h in userspace.
This has also been filed with the ZFS team. It makes the userspace
libzpool code use the zlib API, instead of the Solaris-only and
non-standard zmod.h. The zlib API is almost identical and is a de
facto standard, so this is a no-brainer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If your only going to allow one allocator to be used and it is defined
at compile time there is no point including the others in the build.
This patch could/should be refined for Linux to make the metaslab
configurable at run time. That might be a bit tricky however since
you would need to quiese all IO. Short of that making it configurable
as a module load option would be a reasonable compromise.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove all instances of list handling where the API is not used
and instead list data members are directly accessed. Doing this
sort of thing is bad for portability.
Additionally, ensure that list_link_init() is called on newly
created list nodes. This ensures the node is properly initialized
and does not rely on the assumption that zero'ing the list_node_t
via kmem_zalloc() is the same as proper initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Move xiou stat structures from a header to the dmu.c source as is
done with all the other kstat interfaces. This information is local
to dmu.c registered the xuio kstat and should stay that way.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Replace non-fatal assertion with warning. This was being observed
during testing and it should not be fatal.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the linux kernel 'current' is defined to mean the current process
and can never be used as a local variable in a function. Simply
replace all usage of 'current' with 'curr' in this function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The upstream commit cb code had a few bugs:
1) The arguments of the list_move_tail() call in txg_dispatch_callbacks()
were reversed by mistake. This caused the commit callbacks to not be
called at all.
2) ztest had a bug in ztest_dmu_commit_callbacks() where "error" was not
initialized correctly. This seems to have caused the test to always take
the simulated error code path, which made ztest unable to detect whether
commit cbs were being called for transactions that successfuly complete.
3) ztest had another bug in ztest_dmu_commit_callbacks() where the commit
cb threshold was not being compared correctly.
4) The commit cb taskq was using 'max_ncpus * 2' as the maxalloc argument
of taskq_create(), which could have caused unnecessary delays in the txg
sync thread.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation. This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature. This is safe for
now because the move callbacks are just an optimization. Even
if they are registered we don't ever really have to call them.
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO. In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow. This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path. The
original mask is then restored at vn_close() time. Hats off to the
loop driver which does something similiar for the same reason.
[...]
shrink_slab+0xdc/0x153
try_to_free_pages+0x1da/0x2d7
__alloc_pages+0x1d7/0x2da
do_generic_mapping_read+0x2c9/0x36f
file_read_actor+0x0/0x145
__generic_file_aio_read+0x14f/0x19b
generic_file_aio_read+0x34/0x39
do_sync_read+0xc7/0x104
vfs_read+0xcb/0x171
:spl:vn_rdwr+0x2b8/0x402
:zfs:vdev_file_io_start+0xad/0xe1
[...]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.
However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.
One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set. Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.
A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.
Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP. They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available. This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.
At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places. This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available. They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.
Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock. The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.
Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options. The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
In cmd/splat.c there was a comparison between an __u32 and an int. To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.
In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg(). That in turn resulted
in some deadcode which should be removed since it's now useless.
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros. They must also build with DEBUG defined.
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts. The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY. The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging. Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.
This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros. However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.
Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.
Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set. While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC. It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.
There's lots of change here but this cleanup was overdue.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility. If splat terminates before the threads, accesses to that memory
location by the other threads become invalid. Splat synchronizes with
the threads with the call:
wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:
wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h. This will simplify
handling any further API changes in the future.
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this. That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.
Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests. Much to my surprise the unsigned
64-bit division regression tests failed.
This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel. This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed. After some investigation this turned out to be exactly
the case.
Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl. There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.
The update implementation now passed all the unsigned and signed
regression tests. This should be functional, but not fast, which is
good enough for out purposes. If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform. I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
For some reason when awk invoked by the usermode helper the command
always fails. Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk. Anyway, the simplest
thing to do here is to just make gawk mandatory. I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change. I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25. This means
builds against 2.6.25-2.6.26 kernels were broken.
To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
Use 3 threads and 8 tasks. Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues. Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored. Verify that the expected completion
order occurs.
The splat_taskq_test5_order() function may be useful in more than
one test. This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.
The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram. This commit corrects the error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
On open() and initialize the buffer with the SPL version string. The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Adds a task queue to receive tasks dispatched with TQ_FRONT. Worker
threads pull tasks from this high priority queue before the default
pending queue.
Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task. The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed. The upstream code has been updated
to depend entirely on the the procname member. To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels. Older kernels are
supported by having it expand to .ctl_name = X just as before.
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus. This can result in incorrect behavior from
mutex_owned() and mutex_owner(). This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.
Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.
Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit(). When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().
Finally, improve the mutex regression tests. For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread. This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.
As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited. This basic thread implementation
was correct, this was simply a flaw in the test case.
We might as well have both asprintf() variants. This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
This fix was long overdue. Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call. However,
probably due to lack of time at the moment that informatin never
made it in to the error message. This patch fixes that and trys
to standardize the kmem debug messages as well.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup(). They are all implemented as a thin layer which just calls
their Linux counterparts. As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels. If the kernel does
not provide it then spl-generic implements it.
Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes. This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris. Minor updates were
required to all the call sites where it was included of course.
Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow. This turned out to be due to some
sort of 'optimization' by gcc:
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor)
{
return dividend - divisor * (dividend / divisor);
}
This code was supposed to be using __udivdi3 to implement /, but gcc
instead implemented it via __umoddi3 itself.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove RW_COUNT() from the rwlock implementation. The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock. While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes. In
short it hasn't been worth the trouble.
With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions. As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation. It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis. I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console. This way they
have a good chance of actually saving the stack trace and debug log.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN. This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.
This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise. Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.
Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
Using /tmp/ is a preferable default, it can always be overriden
using the module option on a case-by-case basis.
Additionally standardize some log messages based on the same
default log level used by the kernel.
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files. Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux. The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it. This makes it impossible to woken by a signal
such as SIGTERM. The cv_wait_interruptible() function was added
to handle this case. It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up. This means you do need to be careful and check issig() after
waking.
When dumping a debug log first check that it is safe to create
a new thread and block waiting for it. If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.
During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case. To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.
Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs. I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.
We need dependent packages to be able to include spl_config.h to
build properly. This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed. This solution
works as long as the spl_config.h is included before your projects
config.h. That turns out to be easier said than done. In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.
To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command. This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file. This eliminates
the problem entirely and makes header safe for inclusion.
Also in this change I have removed the few places in the code where
spl_config.h is included. It is now added to the gcc compile line
to ensure the config results are always available.
Finally, I have also disabled the verbose kernel builds. If you
want them back you can always build with 'make V=1'. Since things
are working now they don't need to be on by default.
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks. To help prvent this limit
max kmem based slab size to MAX_ORDER-3. Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype. It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.
I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions. It's not pretty but API compat changes rarely are.
Fix panic() string, which was being used as a format string, instead of an already-formatted string.
Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
This test case verifies the correct behavior of taskq_wait_id().
In particular it ensure the the following two cases are handled
properly:
1) Task ids larger than the waited for task id can run and
complete as long as there is an available worker thread.
2) All task ids lower than the waited one must complete before
unblocking even if the waited task id itself has completed.
In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid. At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary. Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry. I added a large comment to the source
to explain this. But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.
The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long. This resolves that problems and frees up
that CPU time for something useful.
This regression test could crash in splat_kmem_cache_test_reclaim()
due to a race between the slab relclaim and the normal exiting of
the thread. Specifically, the kct structure could be free'd by
the thread performing the allocations while the reclaim function
was also working on that's threads kct structure. The simplest
fix is to extend the kcp->kcp_lock over the reclaim to prevent
the kct from being freed. A better fix would be to ref count
these structures, but since is just a regression this locking
change is enough. Surprisingly this was only observed commonly
under RHEL5.4 but all platform could have hit this.
I must have been in a hurry when I wrote the vnode regression tests
because the error code handling is not correct. The Solaris vnode
API returns positive errno's, these need to be converted to negative
errno's for Linux before being passed back to user space. Otherwise
the test hardness with report the failure but errno will not be set
with the correct error code.
Additionally tests 3, 4, 6, and 7 may fail in the test file already
exists. To avoid false positives a user mode helper has added to
remove the test files in /tmp/ before running the actual test.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing. This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.
Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations. Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total. To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.
The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available. On 32-bit archs this may
not be true, and actually that's just fine. In that case the kernel will
will never be able to allocate more the 32-bits worth anyway. So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
The big fix here is the removal of kmalloc() in kv_alloc(). It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned. This is no longer true atleast in 2.6.31
there are no longer any alignment expectations. Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.
As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures. This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account. The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.
Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.
Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time. To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.
From linux-2.6.31 mm/page_alloc.c:1166
/*
* __GFP_NOFAIL is not to be used in new code.
*
* All __GFP_NOFAIL callers should be fixed so that they
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
WARN_ON_ONCE(order > 1);
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.
min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels. For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it. To summerize:
1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly. This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.
2) --enable-debug-kmem=yes by default. This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload. Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.
3) --enable-debug-kmem-tracking=no by default. This option was added
to provide a configure option to enable to detailed memory allocation
tracking. This support was always there but you had to know where to
turn it on. By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.
4) --enable-debug-kstat removed. After further reflection I can't see
why you would ever really want to turn this support off. It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c. We can now always assume the top level directory
will be there.
5) --enable-debug-callb removed. This never really did anything, it was
put in provisionally because it might have been needed. It turns out
it was not so I am just removing it to prevent confusion.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock. This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts. It however was
likely not good for performance.
After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely. The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t. The only lingering problem for both
implementations is that Solaris provides no atomic read function. This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking. I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API. So really
we are just being bug-for-bug compatible here.
With this change the default implementation is layered on top of Linux
atomic types. However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach. Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
The cmn_err/vcmn_err functions are layered on top of the debug
system which usually expects a newline at the end. However, there
really doesn't need to be a newline there and there in fact should
not be for the CE_CONT case so let's just drop the warning.
Also we make a half-hearted attempt to handle a leading ! which
means only send it to the syslog not the console. In this case
we just send to the the debug logs and not the console.
As of 2.6.25 kobj->k_name was replaced with kobj->name. Some distros
such as RHEL5 (2.6.18) add a patch to prevent this from being a problem
but other older distros such as SLES10 (2.6.16) have not. To avoid
the whole issue I'm updating the code to use kobject_set_name() which
does what I want and has existed all the way back to 2.6.11.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded. This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd. It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly. All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.
Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module. Because of this I have
had to add more autoconf magic than I'd like. However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.
In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type. And I've set rootdir to a non-NULL value.
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
e811949a57 and
d28db80fd0
The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.
Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved. The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.
Just as with the rwlocks we need to track the lock owner. Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you. If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked. This was added to Linux
to support adaptive mutexs, more on that shortly. Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build. If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior. This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.
Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance. From Linux kernel commit:
0d66bf6d3514b35eb6897629059443132992dbd7
"Testing with Ingo's test-mutex application...
gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol. The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.
Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds. This code was now redundant.
Update mutexs validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.
This means we need to track the lock owner which is not something
tracked in a Linux semaphore. After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools. My reasons:
1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel. This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant. This is important because it then maintains the location
information for each rw_* call point.
2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself. This removes
the need for a fancy lookup to get the owner which is optimal for
performance. We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.
3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.
4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization. This is good for all
the obvious reasons, we do give up the ability to specific the lock
name. The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.
Update rwlocks validated on:
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools. This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.
The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked. Since the
rwsem was embedded in a wrapper structure the name was always the
same. The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.
The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure. This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible. It can only be done by directly accessing some of
the internal data member of the rwsem structure.
For example, the Linux API provides a different function for dropping
a reader vs writer lock. Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is. This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock. Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation. This is
all do able, and what I did, but it does complicate things considerably.
The good news is that in addition to the profiling benefits of this
change. We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.
The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem. Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.
In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests. Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.
This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing. The full
SPLAT regression test suite was run numberous times on all of the following
platforms. This includes various kernels ranging from 2.6.16 to 2.6.29.
- SLES10 (ppc64)
- SLES11 (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3 (x86_64)
- RHEL6 (x86_64)
- FC11 (x86_64)
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state(). The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it. Several
autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
be to dynamically aquire it at load time. This is an absolute
last resort for old kernel which we don't want to patch to
cleanly export the symbol.
This interface is going away, and it's not as if most callers actually
use crhold/crfree when working with credentials. So it'll be okay
they we're not taking a reference on the task structure the odds of
it going away while working with a credential and pretty small.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed. This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.
Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure. For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.
As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of. Once again the helper functions nicely hide
the implementation details from all callers.
Three tests were added to the splat test framework to verify basic
correctness. They should be extended as needed when need credential
functions are added.
The slab_overcommit test case could hang on a system with fragmented
memory because it was creating a kmem based slab with 256K objects.
To avoid this I've removed the KMC_KMEM flag which allows the slab
to decide if it should be kmem or vmem backed based on the object
side. The slab_lock test shares this code and will also be effected.
But the point of these two tests is to stress cache locking and
memory overcommit, the type of slab is not critical. In fact, allowing
the slab to do the default smart thing is preferable.
Simply pass the ioctl on to the normal handler. If the ioctl
helper macros are used correctly this should be safe as they
will handle the packing/unpacking of the data encoded in the
ioctl command. And actually, if the caller does not use the
IO* macros at all, and just passes small values, it will probably
be OK as well. We only get in to trouble if they try and use
the upper 32-bits. Endianness is not really a concern here, we
we are pretty much assumed they user and kernel will match.
used to scale the number of threads based on the number of online
CPUs. As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
- Kernel modules should be built using the LINUX_OBJ Makefiles and
not the LINUX Makefiles to ensure the proper install paths are used.
- Install modules in to addon/spl/
- Ensure no additional kernel module build products are packaged.
- Simplified spl.spec.in which supports RHEL, CHAOS, SLES, FEDORA.
- Properly honor --prefix in build system and rpm spec file.
- Add '--define require_kdir' to spec file to support building
rpms against kernel sources installed in non-default locations.
- Add '--define require_kobj' to spec file to support building
rpms against kernel object installed in non-default locations.
- Stop suppressing errors in autogen.sh script.
- Improved logic to detect missing kernel objects when they are
not located with the source. This is the common case for SLES
as well as in-tree chaos kernel builds and is done to simply
support for multiple arches.
- Moved spl-devel build products to /usr/src/spl-<version>, a
spl symlink is created to reference the last installed version.
- Proper ioctl() 32/64-bit binary compatibility. We need to ensure the
ioctl data itself is always packed the same for 32/64-bit binaries.
Additionally, the correct thing to do is encode this size in bytes
as part of the command using _IOC_SIZE().
- Minor formatting changes to respect the 80 character limit.
- Move all SPLAT_SUBSYSTEM_* defines in to splat-ctl.h.
- Increase SPLAT_SUBSYSTEM_UNKNOWN because we were getting close
to accidentally using it for a real registered subsystem.
- Add compat_ioctl() handler, by default 64-bit SLES systems build 32-bit
ELF binaries. For the 32-bit binaries to pass ioctl information to a
64-bit kernel a compatibility handler needs to be registered. In our
case no additional conversions are needed to convert 32-bit ioctl()
commands to 64-bit commands so we can just call the default handler.
- Initial SLES testing uncovered a long standing bug in the debug
tracing. The tcd_for_each() macro expected a NULL to terminate
the trace_data[i] array but this was only ever true due to luck.
All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
at the address of the memory location returned by kallsyms_lookup_name().
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
Instead for_each_zone() operated directly on pgdat_list which may or
may not have been exported depending on how your kernel was compiled.
Now new configure checks determine if you have the helpers or not, and
if the needed symbols are exported. If they are not exported then they
are dynamically aquired at runtime by kallsyms_lookup_name().
- Enable builds for powerpc ISA type.
- Add DIV_ROUND_UP and roundup macros if unavailable.
- Cast 64-bit values for %lld format string to (long long) to
quiet compile warning.
- Configure check for SLES specific API change to vfs_unlink()
and vfs_rename() which added a 'struct vfsmount *' argument.
This was for something called the linux-security-module, but
it appears that it was never adopted upstream.
- Configure check, the div64_64() function was renamed to
div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
in 2.6.18 kernels. The earlier 2.6.16 based SLES10 must not try
and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
the HZ division as an 'unsigned long long' because the SPL only
implements __udivdi3(), and not __divdi3() for 'long long' division
on 32-bit arches.
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done. This is important
because several of the spl headers need the results of these checks to
work properly. Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc). This prevents other packages which also use
autoheader from being include because the definitions conflict. To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product. This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either. The SPL package therefore uses the
equivilant SPL_META_* definitions.
In the interests of portability I have added a FC10/i686 box to
my list of development platforms. The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support. This patch
resolves all remaining issues observed under that environment.
1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added. As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages. The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state(). If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.
2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality. The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.
3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory. On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space. The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.
4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages(). This provides
additional test coverage of the SPL Linux VM integration.
5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M. On x86 based systems it is not possible to create
a kmem backed slab with entires of that size. To compensate for
this the number of allocations performed in increased by 4x.
6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.
7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol. This causes the module to fail
to load and we need to know exactly which symbol was not available.
I'm very surprised this has not surfaced until now. But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.
The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed. This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed. The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.
Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
Mainly for portability reasons I have rebased the mutex tests on Solaris
taskqs instead of linux work queues. The linux workqueue API changed post
2.6.18 kernels and using task queues avoids having to conditionally detect
which workqueue API to use.
Additionally, this is basically free additional testing for the task queues.
Much to my surprise after updating these test cases they did expose a long
standing bug in the taskq_wait() implementation. This patch does not
address that issue but the followup patch does.
Fixes hostid mismatch which leads to assertion failure when the hostid/hw_serial is a 10-character decimal number:
$ zpool status
pool: lustre
state: ONLINE
lt-zpool: zpool_main.c:3176: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
zsh: 5262 abort zpool status
An update to the build system to properly support all commonly
used Makefile targets these include:
make all # Build everything
make install # Install everything
make clean # Clean up build products
make distclean # Clean up everything
make dist # Create package tarball
make srpm # Create package source RPM
make rpm # Create package binary RPMs
make tags # Create ctags and etags for everything
Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels. To build binary
RPMs from the source RPM for your system simply run:
rpmbuild --rebuild spl-x.y.z-1.src.rpm
This will produce two binary RPMs with correct 'requires'
dependencies for your kernel. One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.
spl-x.y.z-1_<kernel version>.x86_64.rpm
spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address. The function name itself can them become a
define which calls a function pointer. This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel. This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.
This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name(). There are
autoconf checks to detect if the symbol is exported and if so to
use it directly. We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported. Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement. The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.
Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
of $LINUX because Module.symvers is a build product. When
$LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
$LINUX/include search paths to allow proper compilation when
the kernel target build directory is not the source directory.
Minimal support added for the zone_get_hostid() function. Only
global zones are supported therefore this function must be called
with a NULL argumment. Additionally, I've added the HW_HOSTID_LEN
define and updated all instances where a hard coded magic value
of 11 was used; "A good riddance of bad rubbish!"
Accidentally leaked list item li in error path. The fix is to
adjust this error path to ensure the allocated list item which
has not yet been added to the list gets freed. To do this we
simply add a new goto label slightly earlier to use the existing
cleanup logic and minimize the number of unique return points.
This was a false positive the callpath being walked is impossible
because the splat_kmem_cache_test_kcp_alloc() function will ensure
kcp->kcp_kcd[0] is initialized to NULL. However, there is no harm
is making this explicit for the test case so I'm adding a line to
clearly set it to correct the analysis.
This check was originally added to detect double initializations
of mutex types (which it did find). Unfortunately, Coverity is
right that there is a very small chance we could trigger the
assertion by accident because an uninitialized stack variable
happens to contain the mutex magic. This is particularly unlikely
since we do poison the mutexs when destroyed but still possible.
Therefore I'm simply removing the assertion.
- The previous magazine ageing sceme relied on the on_each_cpu()
function to call spl_magazine_age() on each cpu. It turns out
this could deadlock with do_flush_tlb_all() which also relies
on the IPI based on_each_cpu(). To avoid this problem a per-
magazine delayed work item is created and indepentantly
scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
spl_kmem_cache_t, they were hold overs from previous cleanup.
- struct work_struct work
- struct timer_list timer
- spl_slab_reclaim() 'continue' changed back to 'break' from commit
37db7d8cf9. The original was correct,
I have added a comment to ensure this does not happen again.
- spl_slab_reclaim() further optimized by moving the destructor call
in spl_slab_free() outside the skc->skc_lock. This minimizes the
length of time the spin lock is held, allows the destructors to
be invoked concurrently for different objects, and as a bonus makes
it safe (although unwise) to sleep in the destructors.
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
spl_slab_reclaim() to cap the number of slabs reclaimed.
On Solaris all eligible slabs are always reclaimed, and this
is still the default behavior. However, I suspect that is
not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
slab/object free lists. Soft lockups were observed when
freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
incorrect 'break' to 'continue' to ensure all slabs are
checked.
- spl_cache_age() reworked to avoid a deadlock with
do_flush_tlb_all() which occured because we slept waiting
for completion in spl_cache_age(). To waiting for magazine
reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
back to using for_each_online_cpu() instead of the
spl_on_each_cpu() approach which was of course a bad idea
due to memory allocations which Ricardo pointed out.
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning. On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.
All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior. Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree. These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.
When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered. When a minor is unregistered we use the vnode
interface to unlink the special file.
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
if the older 4 argument version of on_each_cpu() should be
used or the new 3 argument version. The retry argument was
dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers. The old way this
worked was to pass a data point when initialized the workqueue.
The new API assumed the work item is embedding in a structure
and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
type checked in the bit operations. This silences the warnings.
- Updated autogen products and splat tests accordingly
- Added slab work queue task which gradually ages and free's slabs
from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock. We
now unlink the objects and slabs from the cache itself and attach
them to a private work list. The contents of the list are then
subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
per-cache skc_lock. This includes the addition of KMC_BIT_REAPING
bit mask which is used to prevent concurrent reaping, and to defer
new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
destroyed, this is used to catch any task accessing the cache while
it is being destroyed.
- Add comments to all the functions and additional comments to try
and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
rigerously stress the cache implementation and look for any problems.
This includes correctness and performance tests.
- Updated portable work queue interfaces