We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.
We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4114Closes#4052Closes#4006Closes#3018Closes#2861
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock. Resolve this by taking the z_teardown_lock in
all registered xattr callbacks prior to taking the z_xattr_lock.
This ensures the locks are always taken is the same order thus
preventing a deadlock. Note the z_teardown_lock is taken again
in zfs_lookup() and this is safe because the z_teardown lock is
a re-entrant read reader/writer lock.
* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
__zpl_xattr_get
zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro
* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
zfs_resume_fs
zfs_rezget -> Takes zp->z_xattr_lock
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#3943Closes#3969Closes#4121
Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
truncate and overwrite rather than rename the cache file. This is
the correct fix but it should have only been applied for the kernel
build. In user space rename(2) is needed because ztest depends on
the cache file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4129
The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
This patch uses a mutex type defined in SPL, MUTEX_NOLOCKDEP, to mark
such mutexes when they are initialized. This mutex type causes
attempts to take or release those locks to be wrapped in lockdep_off()
and lockdep_on() calls to silence the dependency checker and allow the
use of lock_stats to examine contention.
For RW locks, it uses an analogous lock type, RW_NOLOCKDEP.
The goal is that these locks are ultimately changed back to type
MUTEX_DEFAULT or RW_DEFAULT, after the locks are annotated to reflect
their relationship (e.g. z_name_lock below) or any real problem with the
lock dependencies are fixed.
Some of the affected locks are:
tc_open_lock:
=============
This is an array of locks, all with same name, which txg_quiesce must
take all of in order to move txg to next state. All default to the same
lockdep class, and so to lockdep appears recursive.
zp->z_name_lock:
================
In zfs_rmdir,
dzp = znode for the directory (input to zfs_dirent_lock)
zp = znode for the entry being removed (output of zfs_dirent_lock)
zfs_rmdir()->zfs_dirent_lock() takes z_name_lock in dzp
zfs_rmdir() takes z_name_lock in zp
Since both dzp and zp are type znode_t, the locks have the same default
class, and lockdep considers it a possible recursive lock attempt.
l->l_rwlock:
============
zap_expand_leaf() sometimes creates two new zap leaf structures, via
these call paths:
zap_deref_leaf()->zap_get_leaf_byblk()->zap_leaf_open()
zap_expand_leaf()->zap_create_leaf()->zap_expand_leaf()->zap_create_leaf()
Because both zap_leaf_open() and zap_create_leaf() initialize
l->l_rwlock in their (separate) leaf structures, the lockdep class is
the same, and the linux kernel believes these might both be the same
lock, and emits a possible recursive lock warning.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3895
Both lock types were introduced in SPL to allow some locks to be
taken/released with linux lockdep turned off. See SPL commit for
details.
Add the new lock types to zfs_context.h to allow user space compilation.
Depends on SPL commit 692ae8d
SPL pull request refs/pull/480/head
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3895
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4110
Original-patch-by: @jgoerzen
Signed-off-by: Benjamin Albrecht <git@albrecht.io>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/pkg-zfs#102
Closes#4029
Correct some misspelled words and grammatical errors, and remove
trailing white space in the man pages.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4115
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline. This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread. However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.
To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed. This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer. Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly. Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.
* z_wr_iss
zio_vdev_io_start
vdev_queue_io -> Takes vq->vq_lock
vdev_queue_io_to_issue
vdev_queue_aggregate
zio_buf_alloc -> Waiting on spl_kmem_cache process
* z_wr_int
zio_vdev_io_done
vdev_queue_io_done
mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss
* txg_sync
spa_sync
dsl_pool_sync
zio_wait -> Waiting on zio being handled by z_wr_int
* spl_kmem_cache
spl_cache_grow_work
kv_alloc
spl_vmalloc
...
evict
zpl_evict_inode
zfs_inactive
dmu_tx_wait
txg_wait_open -> Waiting on txg_sync
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3808Closes#3867
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.
Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.
This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#515
Issue zfsonlinux/zfs#4111
There exists a lock inversion between the z_xattr_lock and the
z_teardown_lock. Detect this case and return EBUSY so zfs_resume_fs()
will mark the inode stale and it can be safely revalidated on next
access.
* process-1
zpl_xattr_get -> Takes zp->z_xattr_lock
__zpl_xattr_get
zfs_lookup -> Takes zsb->z_teardown_lock in ZFS_ENTER macro
* process-2
zfs_ioc_recv -> Takes zsb->z_teardown_lock in zfs_suspend_fs()
zfs_resume_fs
zfs_rezget -> Takes zp->z_xattr_lock
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#3969
This patch provides 2 new kstats to display task queues:
/proc/spl/taskqs-all - Display all task queues
/proc/spl/taskqs - Display only "active" task queues
A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.
If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.
If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.
If the task queue has any waiters, displays each waiting task's pid.
Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
Since uio now supports bvec, we can convert bio into uio and reuse
dmu_{read,write}_uio. This way, we can remove some duplicate code.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4078
Userspace can freely pass in whatever iovec it feels like, and it's perfectly
legal to pass an iovec which contains a zero length segment. In the current
implementation, uio_prefaultpages would touch an out of bound byte in the
"last byte" logic. While this probably wouldn't cause any critical error, we
would like uio_prefaultpages to be able to continue gracefully.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4078
If a bit were cleared in `bp->blk_birth` such that the txg birth
was now lower than any other txg_birth in the deadlist, then there
will be no entry before this in the tree.
This should be impossible but regardless error handling code has
been added for this case. By default this is left as a fatal case
and the blk_birth is logged. However, setting `zfs_recover=1` will
cause the bp to be placed at the start of the deadlist even though
it contains an invalid blk_birth.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4086Closes#4089
Commit 5f6d0b6 was originally added to gracefully handle block
pointers with a damaged logical size. However, it incorrectly
assumed that all passed arc_done_func_t could handle a NULL
arc_buf_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4069Closes#4080
Commit ca0bf58d to address arcs_mtx contention removed column "index"
from the output of kstats/dbuf.
dbufstat.py was not updated to reflect this, which causes it to crash
when run with -bx
This removes "index" from hardcoded lists of columns.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4096
This reverts commit 2026196230.
It is no longer necessary now that we pass -DDEBUG unconditionally.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4095
Illumos unconditionally builds zdb and ztest with -DDEBUG. This helps
catch bugs and eliminates the need for commits like
2026196230, which changed ASSERTs to
VERIFYs. The following files in the illumos tree show this:
usr/src/cmd/zdb/Makefile.com
usr/src/cmd/ztest/Makefile.com
Given the usefulness of having early failure in these tools, we should
do it too.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4095
While exceptionally unlikely to cause a problem the zfs_snapentry_t
hold should be taken before the dispatch to prevent any possibility
of the task being processed before the hold.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
When a concorrent mount finishes just before calling to
zfsctl_snapshot_ismounted, if we return EISDIR, the VFS will return
with EREMOTE. We should instead just return 0, so VFS may retry and
would likely notice the dentry is alreadly mounted. This will be
inline with when usermode helper return EBUSY.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By changing the zfs_snapshot_lock from a mutex to a rw lock the
zfsctl_lookup_objset() function can be allowed to run concurrently.
This should reduce the latency of fh_to_dentry lookups in ZFS
snapshots which are being accessed over NFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
The zfsctl_snapshot_unmount_delay() function must not be called
from zfsctl_lookup_objset() while it is currently holding the
zfs_snapshot_lock. This will result in a deadlock. It is safe
to call zfsctl_snapshot_unmount_delay_impl() directly because the
function already has a reference on the zfs_snapentry_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#3997
There are cases where it's desirable that auto-mounted snapshots
not expire after a fixed duration. They should be unmounted only
when the filesystem they are a snapshot of is unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
This reverts commit 61bbbd9a77 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507
This patch only addresses the issues identified by the style checker.
It contains no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#506
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking. This is
a false positive.
One such call chain is as follows, when a taskq needs more threads:
taskq_dispatch->taskq_thread_spawn->taskq_dispatch
The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads. The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock. Without subclassing, lockdep believes these
could potentially be the same lock and complains. A similar case occurs
when taskq_dispatch() then calls task_alloc().
This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:
subclass taskq
TQ_LOCK_DYNAMIC dynamic_taskq
TQ_LOCK_GENERAL any other
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep. The warning is a false positive.
The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file. The inode
lock is taken on both the parent inode and the file inode.
VFS provides an enum to subclass the lock. This patch changes the
spin_lock call to _nested version and uses the provided enum.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.
When lockdep detects these conditions, it disables further lock analysis
for all locks. This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.
This commit creates a new type of mutex, MUTEX_NOLOCKDEP. This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().
This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.
MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
The SPL fails to build with some "Configured" kernels (ex. openSUSE
xen Kernel) this change should make same binaries with C compiler
optimization.
Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#510
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4048
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048
This reverts commit a430c11f0b. Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
objsetid is not unique across pool, so using it solely as key would cause
panic when automounting two snapshot on different pools with the same
objsetid. We fix this by adding spa pointer as additional key.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3948
Issue #3786
Issue #3887
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame. This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.
This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#500
While stack size will vary by architecture it has historically defaulted to
8K on x86_64 systems. However, as of Linux 3.15 the default thread stack
size was increased to 16K. These kernels are now the default in most non-
enterprise distributions which means we no longer need to assume 8K stacks.
This patch takes advantage of that fact by appropriately reverting stack
conservation changes which were made to ensure stability. Changes which
may have had a negative impact on performance for certain workloads. This
also has the side effect of bringing the code slightly more in line with
upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4059
Running arcstat.py -x currently throws KeyError due to rmis being
absent, it was removed in commit ca0bf58.
Signed-off-by: cable2999 <cable2999@users.noreply.github.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3931
Continuations should be indented four spaces.
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4062
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5959https://github.com/illumos/illumos-gate/commit/ca0cc39
Porting notes:
illumos code doesn't check for feature_get_refcount() returning
ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added
a check in https://github.com/zfsonlinux/zfs/commit/784652c
due to #3468. The check was reintroduced here.
Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3965
Provide a generic interface to prefetch ZAP entries by name. This
functionality is being added for external consumers such as Lustre.
It is based of the existing zap_prefetch_uint64() version which is
used by the deduplication code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#4061
Adding additional entries to the efi conversion array will help prevent
the overwriting of the GPTs of disks with in-use file systems in more
cases. Most notably, this adds partition type 8300 "Linux filesystem"
(0FC63DAF-8483-4772-8E79-3D69D8477DE4), which is often used for ext4 and
btrfs, among others.
This commit itself does nothing to address the underlying problematic
behavior that check_slice() isn't called on partitions of an
unrecognized type, even when they contain a currently mounted file
system.
The additional entries were derived from these two resources:
https://en.wikipedia.org/wiki/GUID_Partition_Tablehttp://sourceforge.net/p/gptfdisk/code/ci/master/tree/parttypes.cc
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4016