DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().
Note that this only works for regular files, not ZVOLs.
This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.
This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:
> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.
We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.
On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.
We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#392
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#374.
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
Add the bare minimum functionality to support dynamic kstats. A
complete kstat implementation should be done as part of issue #84.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s. This patch clarifies things
by cleanly breaking the two subsystem apart. The result of this
is the following behavior.
--enable-debug - Enable/disable code wrapped in ASSERT()s.
--disable-debug ASSERT()s are used to check invariants and
are never required for correct operation.
They are disabled by default because they
may impact performance.
--enable-debug-log - Enable/disable the debug log infrastructure.
--disable-debug-log This infrastructure allows the spl code and
its consumer to log messages to an in-kernel
log. The granularity of the logging can be
controlled by a debug mask. By default the
mask disables most debug messages resulting
in a negligible performance impact. Because
of this the debug log is enabled by default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Historically the internal zfs debug infrastructure has been
scattered throughout the code. Since we expect to start making
more use of this code this patch performs some cleanup.
* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
files. This includes moving the zfs_flags and zfs_recover
variables, plus moving the zfs_panic_recover() function.
* Remove the existing unused functionality in zfs_debug.c and
replace it with code which correctly utilized the spl logging
infrastructure.
* Remove the __dprintf() function from zfs_ioctl.c. This is
dead code, the dprintf() functionality in the kernel relies
on the spl log support.
* Remove dprintf() from hdr_recl(). This wasn't particularly
useful and was missing the required format specifier anyway.
* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap. The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.
This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id. The zpl
code will just ignore additional entries. This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.
Once the upstream feature flags support is merged in this change
should be updated to a read-only feature. Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq
Taskq threads need only add themselves to the work wait queue if
there are no pending work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
This reverts commit ec2b41049f.
A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:
1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps
This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive(). The handler would
run immediately within the race window.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched. The lock hold time
is reduced by the following changes:
1) Use exclusive threads in the work_waitq
When a single work item is dispatched we only need to wake a single
thread to service it. The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.
2) Conditionally add/remove threads from work waitq outside of tq_lock
Taskq threads need only add themselves to the work wait queue if there
are no pending work items. Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.
3) Call wake_up() outside of tq->tq_lock
Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().
A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes. The following table summarizes the
results using data from the kernel lock profiler.
tq_lock time %diff Wall clock (s) %diff
original: 39117614.10 0 41.72 0
exclusive threads: 31871483.61 18.5 34.2 18.0
unlocked add/rm waitq: 13794303.90 64.7 16.17 61.2
unlocked wake_up(): 1589172.08 95.9 16.61 60.2
Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows. %diff is calculated with respect to "original" as
100*(orig-new)/orig.
Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time. This is because the point of contention shifts from
the taskq lock to the wait queue lock. But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#32
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue. This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example. This test should always pass. Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code. FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651
Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f ]` expression that
incorrectly evaluates to true.
# ./configure --with-config=user
# make distclean
Making distclean in module
make[1]: Entering directory `/spl/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/spl/module: No such file or directory. Stop.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.
Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue. It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue. When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it. They then all pile into the work wait queue to wait for
new work items. So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.
Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem. It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread. In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling. So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.
This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.
In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types. This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check(). In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers. Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#58
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.
In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).
In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.
Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#71
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.
Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.
Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).
To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
It has been observed that some of the hottest locks are those
of the zio taskqs. Contention on these locks can limit the
rate at which zios are dispatched which limits performance.
This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock. This has the effect
of improving system performance.
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/734
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#482
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface. If the
old interface passes each of its tests, the new interface is
exercised. Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#65
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit. It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq. This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.
commit 5aeb94743e3be0c51e86f73096334611ae3a058e
Author: Garrett D'Amore <garrett@nexenta.com>
Date: Wed Jul 27 07:13:44 2011 -0700
734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.
The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes. Previously these fields were private
so the name never mattered.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#64
The zvol_major and zvol_threads module options were being created
with 0 permission bits. This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`. This patch fixes the issue by updating
the permission bits to 0444. For the moment these options must
be read-only because they are used during module initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.
To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB. The rational for
this is as follows:
* Desktop Systems (less than 8GB of memory)
Limiting the ARC to 1/2 of memory is desirable for desktop
systems which have highly dynamic memory requirements. For
example, launching your web browser can suddenly result in a
demand for several gigabytes of memory. This memory must be
reclaimed from the ARC cache which can take some time. The
user will experience this reclaim time as a sluggish system
with poor interactive performance. Thus in this case it is
preferable to leave the memory as free and available for
immediate use.
* Server Systems (more than 8GB of memory)
Using all but 4GB of memory for the ARC is preferable for
server systems. These systems often run with minimal user
interaction and have long running daemons with relatively
stable memory demands. These systems will benefit most by
having as much data cached in memory as possible.
These values should work well for most configurations. However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC. This can still be accomplished
by setting the 'zfs_arc_max' module option.
Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation. Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory. How much more memory get's
consumed will be determined by how badly fragmented the slabs are.
In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution. Or preferably, using the page cache
to back the ARC under Linux would be even better. See issue #75
for the benefits of more tightly integrating with the page cache.
This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory. The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken. This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call. Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.
The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes. This was done simply to be consistent with the Solaris
behavior.
Upon futher reflection I believe this should be allowed under
Linux. The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system. This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#272
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
The splat_taskq_test4_common function was incorrectly referencing
the splat_taskq-test13_func symbol, when it meant to be using the
splat_taskq_test4_func symbol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#61
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.
Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#425
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict. Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#56
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.
This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#445
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed. This same task is now accomplished
more cleanly with per super block shrinkers. This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.
This support has always been entirely optional. So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.
The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced. However, the current zfs
implementation in this regard is already flawed and needs to
be reworked. If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules. As of Linux 3.1 it is now longer easily
available. To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot. Disown the dataset if SA are expected but
are in fact missing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound. A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.
It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory. By default a kmem_cache will
dynamically determine the type of memory used based on the object
size. For large objects virtual memory is usually preferable
and for small object physical memory is a better choice. See
the spl_slab_alloc() function for a longer discussion on this.
However, there is a certain amount of gray area when defining a
'large' object. For the following caches it turns out they were
just over the line:
* dnode_cache
* zio_cache
* zio_link_cache
* zio_buf_512_cache
* zfs_data_buf_512_cache
Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses. This entirely avoids the
need to serialize on the virtual address space lock.
As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
zpl_putpage() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#431