When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated. The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#119
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.
This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.
Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#786
FreeBSD #xxx: Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
NOTE: This patch only appears in FreeBSD. If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.
When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.
The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree. This
causes the following failure:
error: implicit declaration of function '__FD_CLR'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current
__clear_close_on_exec() interface for readability. Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#124
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized. In practice this will never occur but
to keep gcc happy we initialize the variable to zero.
Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.
The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#783
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#774
In the module unload path the vm_file_cache was being destroyed
under a spin lock. Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.
This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock. There is no need to lock this operation here.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#771
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes. This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#706
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#109
This test is designed to verify that direct reclaim is functioning as
expected. We allocate a large number of objects thus creating a large
number of slabs. We then apply memory pressure and expect that the
direct reclaim path can easily recover those slabs. The registered
reclaim function will free the objects and the slab shrinker will call
it repeatedly until at least a single slab can be freed.
Note it may not be possible to reclaim every last slab via direct reclaim
without a failure because the shrinker_rwsem may be contended. For this
reason, quickly reclaiming 3/4 of the slabs is considered a success.
This should all be possible within 10 seconds. For reference, on a
system with 2G of memory this test takes roughly 0.2 seconds to run.
It may take longer on larger memory systems but should still easily
complete in the alloted 10 seconds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered. This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.
Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress. Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made. Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations. However, since we are using none
of that infrastructure we're responsible for incrementing this
count.
If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked. If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#107
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.
This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage. In addition,
it should allow for snappier reclaim under memory pressure.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#102
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.
Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#101
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM. All that is
required is to make the symbols available to the linker with the
expected names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit ce90208cf9. This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels. This issue was filed as #710.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#342
As of the removal of the taskq work list made in commit:
commit 2c02b71b14
Author: Prakash Surya <surya1@llnl.gov>
Date: Mon Dec 5 17:32:48 2011 -0800
Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.
the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.
This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#660
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem. It would attempt to recognize low memory situtions
before they occured and evict data from the cache. It would also
make assessments about if there was enough free memory to perform
a specific operation.
This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available. To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces. While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory. This has manifested
itself in the past as a spinning arc_reclaim_thread.
This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface. That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback. The Linux VM will call this
function when it needs memory. The ARC is then responsible for
attempting to free the requested amount of memory if possible.
Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.
* Removed the hdr_recl() reclaim callback which is redundant
with the broader arc_shrinker_func().
* Reduced arc_grow_retry to 5 seconds from 60. This is now used
internally in the ARC with arc_no_grow to indicate that direct
reclaim was recently performed. This typically indicates a
rapid change in memory demands which the kswapd threads were
unable to keep ahead of. As long as direct reclaim is happening
once every 5 seconds arc growth will be paused to avoid further
contributing to the existing memory pressure. The more common
indirect reclaim paths will not set arc_no_grow.
* arc_shrink() has been extended to take the number of bytes by
which arc_c should be reduced. This allows for a more granual
reduction of the arc target. Since the kernel provides a
reclaim value to the arc_shrinker_func() this value is used
instead of 1<<arc_shrink_shift.
* arc_reclaim_needed() has been removed. It was used to determine
if the system was under memory pressure and relied extensively
on Solaris specific VM interfaces. In most case the new code
just checks arc_no_grow which indicates that within the last
arc_grow_retry seconds direct memory reclaim occurred.
* arc_memory_throttle() has been updated to always include the
amount of evictable memory (arc and page cache) in its free
space calculations. This space is largely available in most
call paths due to direct memory reclaim.
* The Solaris pageout code was also removed to avoid confusion.
It has always been disabled due to proc_pageout being defined
as NULL in the Linux port.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Expose the zfs_mdcomp_disable variable as a module option. This
can be used to disable compression of zfs meta data which is
enabled by default. This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:
crash> bt 4123
PID: 4123 TASK: ffff88062f8c1500 CPU: 6 COMMAND: "l2arc_feed"
0 [ffff88062511d610] schedule at ffffffff814eeee0
1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
2 [ffff88062511d748] mutex_lock at ffffffff814f041b
3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
19 [ffff88062511dee8] kthread at ffffffff81090a86
20 [ffff88062511df48] kernel_thread at ffffffff8100c14a
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling. This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#652
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions. They allow
us to cleanly register a callback which is called when a
dataset property is modified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Long ago I added support to the spl for condition variable names
because I thought they might be needed. It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.
cv_init(9F)
Parameters
name - Descriptive string. This is obsolete and should be
NULL. (Non-NULL strings are legal, but they're a
waste of kernel memory.)
Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.
This bug can be reproduced using this shell script:
#!/bin/sh
(
while true; do
zfs create -o mountpoint=legacz tank/bar
zfs destroy tank/bar
done
) &
(
while true; do
mount -t zfs tank/bar /mnt
umount /mnt
done
) &
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#639
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats. This was harmless but
confusing since memory usage could be over reported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging. When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.
This checking is particularly helpful when adding new dmu consumers
like Lustre. However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.
--enable-debug-dmu-tx - Enable/disable validation of each tx as
--disable-debug-dmu-tx it is constructed. By default validation
is disabled due to performance concerns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.
SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used. This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().
In my option there's a kernel bug here too which allows this to
happen. They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#617
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
Add a standard zio constructor and destructor. Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations. However, in this
case none of the operations moved out of zio_create() were really
very expensive.
This change was principly made as a debug patch (and workaround)
for a zio_destroy() race. The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread. This would completely explain the observed symptoms in
the issue report.
This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor. Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.
Once the real root cause is determined and resolved this change
can be safely reverted. Until then this should help workaround
the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio. This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered. It
may also have caused problems when issues logical I/Os.
Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix). I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.
This reverts commit 2c6d0b1e07.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated. This NULL was
accidentally missing which could result in a NULL dereference.
Interestingly this issue only manifested itself on certain 32-bit
systems. Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle. This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold. Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a standard zio constructor and destructor. Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations. However, in this
case none of the operations moved out of zio_create() were really
very expensive.
This change was principly made as a debug patch (and workaround)
for a zio_destroy() race. The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread. This would completely explain the observed symptoms in
the issue report.
This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor. Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.
Once the real root cause is determined and resolved this change
can be safely reverted. Until then this should help workaround
the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit(). If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made. This is done to
prevent data from leaking in to the syncing txg. As a result
the original dirty spill block will remain cached.
Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache. It's far safer and simpler just to use a private
handle for xattrs. Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.
These forever dirty buffers can be noticed in the arcstats under
the anon_size. On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase. Eventually, if enough dirty data builds up
your system it will appear to hang. This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.
As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg. This can be useful
when attempting to determine why a particular workload is
under performing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>