Add the bare minimum functionality to support dynamic kstats. A
complete kstat implementation should be done as part of issue #84.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s. This patch clarifies things
by cleanly breaking the two subsystem apart. The result of this
is the following behavior.
--enable-debug - Enable/disable code wrapped in ASSERT()s.
--disable-debug ASSERT()s are used to check invariants and
are never required for correct operation.
They are disabled by default because they
may impact performance.
--enable-debug-log - Enable/disable the debug log infrastructure.
--disable-debug-log This infrastructure allows the spl code and
its consumer to log messages to an in-kernel
log. The granularity of the logging can be
controlled by a debug mask. By default the
mask disables most debug messages resulting
in a negligible performance impact. Because
of this the debug log is enabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Historically the internal zfs debug infrastructure has been
scattered throughout the code. Since we expect to start making
more use of this code this patch performs some cleanup.
* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
files. This includes moving the zfs_flags and zfs_recover
variables, plus moving the zfs_panic_recover() function.
* Remove the existing unused functionality in zfs_debug.c and
replace it with code which correctly utilized the spl logging
infrastructure.
* Remove the __dprintf() function from zfs_ioctl.c. This is
dead code, the dprintf() functionality in the kernel relies
on the spl log support.
* Remove dprintf() from hdr_recl(). This wasn't particularly
useful and was missing the required format specifier anyway.
* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap. The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.
This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id. The zpl
code will just ignore additional entries. This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.
Once the upstream feature flags support is merged in this change
should be updated to a read-only feature. Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq
Taskq threads need only add themselves to the work wait queue if
there are no pending work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
This reverts commit ec2b41049f.
A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:
1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps
This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive(). The handler would
run immediately within the race window.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
Since the zpios and potentially other ZFS tests use the
DMU_OST_OTHER type to label their datasets, the zpool and
zfs commands should gracefully handle this type when it is
encountered. This patch modifies the commands' behavior
to ignore any datasets with a dds_type of DMU_OST_OTHER.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#536
This change updates the rpm spec files to have strictly correct
package dependencies. That means a few things:
* The zfs-modules package is now tied to a specific build of
the spl-modules packages based on the kernel version. This
ensures that the correct spl-modules packages will always get
installed and not just the newest.
* The zfs package now requires both the zfs-modules and spl
packages. Thus a 'yum install zfs' will pull in the minimal
set of packages required for a functional system.
* The zfs-devel packages now require the zfs package to be
installed which is normal behavior for -devel packages.
* Remove the redundant distribution release extension. This
is already added once because it is part of the kernel package
release name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the original build system code was added the release
component was accidentally omited from the development header
install path. This patch adds the missing path component so
it's always clear exactly what release your compiling against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change updates the rpm spec files to have strictly correct
package dependencies. That means a few things:
* Add a dependency to the spl package for the spl-modules package.
This ensures that when running 'yum install spl' that newest
version of the spl-modules will be installed.
* Remove the redundant distribution release extension. This
is already added once because it is part of the kernel package
release name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the original build system code was added the release
component was accidentally omited from the development header
install path. This patch adds the missing path component so
it's always clear exactly what release your compiling against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit zfsonlinux/zfs@57a4eddc4d
allows the bootfs property to be set on any pool, but does not
accommodate subsequent vdev changes. For example:
# zpool replace rpool /dev/sda /dev/sdb
operation not supported on this type of pool
property 'bootfs' is not supported on EFI labeled devices
For non-Solaris builds, disable the check that emits this error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq outside of tq_lock
Taskq threads need only add themselves to the work wait queue if there
are no pending work items. Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.
3) Call wake_up() outside of tq->tq_lock
Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().
A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes. The following table summarizes the
results using data from the kernel lock profiler.
tq_lock time %diff Wall clock (s) %diff
original: 39117614.10 0 41.72 0
exclusive threads: 31871483.61 18.5 34.2 18.0
unlocked add/rm waitq: 13794303.90 64.7 16.17 61.2
unlocked wake_up(): 1589172.08 95.9 16.61 60.2
Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows. %diff is calculated with respect to "original" as
100*(orig-new)/orig.
Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time. This is because the point of contention shifts from
the taskq lock to the wait queue lock. But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#32
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue. This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example. This test should always pass. Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:
* libspl: SPL Programming Language
* libavl: AVL for Linux
* libefi: GRUB
And these libraries are potential conflicts:
* libshare: the Linux Mount Manager
* libunicode: Perl and Python
Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:
+ libnvpair
+ libuutil
+ libzpool
+ libzfs
This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code. FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651
Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f ]` expression that
incorrectly evaluates to true.
# ./configure --with-config=user
# make distclean
Making distclean in module
make[1]: Entering directory `/spl/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/spl/module: No such file or directory. Stop.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.
Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue. It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue. When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it. They then all pile into the work wait queue to wait for
new work items. So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.
Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem. It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread. In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling. So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
Originally, the per-file link limit was set to 65536 because the
exact Linux VFS limit was unclear. Internally ZFS is able to
support 64-bit link counts. After a more careful investigation
the limit can be safely raised to 2^31-1.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#514
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, SPL_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#517
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, ZFS_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#517
Linux supports mounting over non-empty directories by default.
In Solaris this is not the case and -O option is required for
zfs mount to mount a zfs filesystem over a non-empty directory.
For compatibility, I've added support for -O option to mount
zfs filesystems over non-empty directories if the user wants
to, just like in Solaris.
I've defined MS_OVERLAY to record it in the flags variable if
the -O option is supplied. The flags variable passes through
a few functions and its checked before performing the empty
directory check in zfs_mount function. If -O is given, the
check is not performed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#473
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.
This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.
In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types. This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.
Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:
warning: passing argument 1 of ‘spinlock_check’
from incompatible pointer type [enabled by default]
note: expected ‘struct spinlock_t *’
but argument is of type ‘struct raw_spinlock_t *’
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76Closes: zfsonlinux/zfs#463
Some implementations of `awk` incorrectly parse the \< and \> regex
symbols, so use a `while read` loop and regular globbing instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #259
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check(). In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers. Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#58
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.
Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.
Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#72
Regenerating the autotools configuration on Debian and Ubuntu systems
causes compilation to fail with this error message:
cmd/mount_zfs/../../cmd/mount_zfs/mount_zfs.c:403:
undefined reference to `is_selinux_enabled'
In the automake template, set "mount_zfs_LDFLAGS = ... $(LIBSELINUX)"
so that the /sbin/mount.zfs utility is linked to libselinux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.
In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).
In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.
Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#71
ZoL and all Solaris derivatives allow pool names to contain the colon
and space characters. Update the man page to reflect current behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #438
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.
Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.
Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).
To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
It has been observed that some of the hottest locks are those
of the zio taskqs. Contention on these locks can limit the
rate at which zios are dispatched which limits performance.
This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock. This has the effect
of improving system performance.
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/734
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#482
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface. If the
old interface passes each of its tests, the new interface is
exercised. Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#65
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit. It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq. This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.
commit 5aeb94743e3be0c51e86f73096334611ae3a058e
Author: Garrett D'Amore <garrett@nexenta.com>
Date: Wed Jul 27 07:13:44 2011 -0700
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.
The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes. Previously these fields were private
so the name never mattered.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#64
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#70
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them. Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
The zvol_major and zvol_threads module options were being created
with 0 permission bits. This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`. This patch fixes the issue by updating
the permission bits to 0444. For the moment these options must
be read-only because they are used during module initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.
To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB. The rational for
this is as follows:
* Desktop Systems (less than 8GB of memory)
Limiting the ARC to 1/2 of memory is desirable for desktop
systems which have highly dynamic memory requirements. For
example, launching your web browser can suddenly result in a
demand for several gigabytes of memory. This memory must be
reclaimed from the ARC cache which can take some time. The
user will experience this reclaim time as a sluggish system
with poor interactive performance. Thus in this case it is
preferable to leave the memory as free and available for
immediate use.
* Server Systems (more than 8GB of memory)
Using all but 4GB of memory for the ARC is preferable for
server systems. These systems often run with minimal user
interaction and have long running daemons with relatively
stable memory demands. These systems will benefit most by
having as much data cached in memory as possible.
These values should work well for most configurations. However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC. This can still be accomplished
by setting the 'zfs_arc_max' module option.
Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation. Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory. How much more memory get's
consumed will be determined by how badly fragmented the slabs are.
In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution. Or preferably, using the page cache
to back the ARC under Linux would be even better. See issue #75
for the benefits of more tightly integrating with the page cache.
This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory. The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken. This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75