3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@ec94d32https://illumos.org/issues/3581
Notes for Linux port:
Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU. We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.
This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues. The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed. It's not clear if 3329 affects the
Linux port or not. I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.
I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time. However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%. Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.
Ported-by: Ned Bass <bass6@llnl.gov>
Closes#1327
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
illumos/illumos-gate@01f55e48fbhttps://www.illumos.org/issues/3329https://www.illumos.org/issues/3330https://www.illumos.org/issues/3331https://www.illumos.org/issues/3335
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
page and help
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@31d7e8fa33https://www.illumos.org/issues/3306https://www.illumos.org/issues/3321
The vdev_file.c implementation in this patch diverges significantly
from the upstream version. For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.
1) The vdev_disk.c code in ZoL has been rewritten to use the
Linux block device interfaces which differ significantly
from those in Illumos. Therefore, updating the vdev_file.c
to use the Illumos interfaces doesn't get you consistency
with vdev_disk.c.
2) Using the upstream patch as is would requiring implementing
compatibility code for those Solaris block device interfaces
in user and kernel space. That additional complexity could
lead to confusion and doesn't buy us anything.
For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function. This has the advantage of being
low risk and easy to understand. Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.
Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased. The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention. We increase it to 8 until
the contention can be address by porting Illumos #3581.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1354
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606bhttps://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1396
A deadlock was accidentally introduced by commit e95853a which
can occur when the system is under memory pressure. What happens
is that while the txg_quiesce thread is holding the tx->tx_cpu
locks it enters memory reclaim. In the context of this memory
reclaim it then issues synchronous I/O to a ZVOL swap device.
Because the txg_quiesce thread is holding the tx->tx_cpu locks
a new txg cannot be opened to handle the I/O. Deadlock.
The fix is straight forward. Move the memory allocation outside
the critical region where the tx->tx_cpu locks are held. And for
good measure change the offending allocation to KM_PUSHPAGE to
ensure it never attempts to issue I/O during reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1274
According to the getxattr(2) man page the ERANGE errno should be
returned when the size of the value buffer is to small to hold the
result. Prior to this patch the implementation would just truncate
the value to size bytes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1408
The zpl_readdir() function shouldn't be registered as part of
the zpl_file_operations table, it must only be part of the
zpl_dir_file_operations table. By removing this callback
the VFS will now correctly return ENOTDIR when calling
getdents() on a file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1404
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices. However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported. In practice,
there are several cases where settong a lower ashift is useful:
* Most modern drives now correctly report their physical sector
size as 4k. This causes zfs to correctly default to using a 4k
sector size (ashift=12). However, for some usage models this
new default ashift value causes an unacceptable increase in
space usage. Filesystems with many small files may see the
total available space reduced to 30-40% which is unacceptable.
* When replacing a drive in an existing pool which was created
with ashift=9 a modern 4k sector drive cannot be used. The
'zpool replace' command will issue an error that the new drive
has an 'incompatible sector alignment'. However, by allowing
the ashift to be manual specified as smaller, non-optimal,
value the device may still be safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1381Closes#1328
Issue #967
Issue #548
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@bda8819455https://www.illumos.org/issues/3422https://www.illumos.org/issues/3425
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1390
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match. However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().
While this should never happen it has been observed on various
platforms. The result is that unless your willing to patch the
ZFS code the pool is inaccessible. Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.
http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1308
Enabling metaslab debugging will prevent space maps from being
automatically unloaded. This can significantly increase the
memory footprint but being able to dynamically control this is
helpful for debugging and certain performance testing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mainline kernel started defining GCC_VERSION with commit
torvalds/linux@3f3f8d2f48.
Unfortunately, LZ4 also defines this macro, but the two
defintions are incompatible. We undefine GCC_VERSION in lz4.c
to handle this.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1339
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices. By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/. When set to
'visible' all zvol snapshots for the dataset will be visible.
This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created. When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions. This leads
to a variety of issues:
1) The zvol partition tables will be read in the context of
the `modprobe zfs` for automatically imported pools. This
is undesirable and should be done asynchronously, but for
now reducing the number of visible devices helps.
2) Udev expects to be able to complete its work for a new
block devices fairly quickly. When many zvol devices are
added at the same time this is no longer be true. It can
lead to udev timeouts and missing /dev/zvol links.
3) Simply having lots of devices in /dev/ can be aukward from
a management standpoint. Hidding the devices your unlikely
to ever use helps with this. Any snapshot device which is
needed can be made visible by changing the snapdev property.
NOTE: This patch changes the default behavior for zvols which
was effectively 'snapdev=visible'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1235Closes#945
Issue #956
Issue #756
The changes to zvol.c were never merged from the last onnv_147
bulk update. This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.
This causes a regression when importing a zpool with zvols
read-only. This does not impact pool which only contain
filesystem datasets.
References:
illumos/illumos-gate@f9af39b
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1332Closes#1333
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1300
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL). On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.
This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable. It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.
To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.
1) Functions which open the device but only read had the
O_EXCL flag removed and were updated to use O_RDONLY.
2) A common holder was added to the vdev disk code. This
allow the ZFS code to internally open the device multiple
times but non-ZFS callers may not.
3) An exception was added to make_disks() for hot spare when
creating partition tables. For hot spare devices which
are already opened exclusively we skip creating the partition
table because this must already have been done when the disk
was originally added as a hot spare.
Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix. And is_spare() was moved
above make_disks() to avoid adding a forward reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#250
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1. The wholedisk handling
is performed by the user space utilities. To prevent confusion
this dead code is being removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case. According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set. Under Linux
we would always close and open the device.
This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing. The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links. The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.
This patch adds the missing functionality in a similiar fashion to
to the Illumos code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.
Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.
As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.
This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.
Thanks to Richard Kojedzinszky for catching this nasty bug.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1318
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.
When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues. We've now reached that point. Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.
This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer. This was the only caller.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them. We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1282
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN. It does however
appear to correctly handle MAXPATHLEN for us.
The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function. Several in-tree filesystems
handle this case the same way.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1279
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.
Also make a small correction to the comment block above
dsl_dir_open_spa().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1268
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used. Otherwise we get the following compiler error:
dmu_send.c:376: error: integer constant is too large for
‘long’ type
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle(). The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.
To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users. Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.
By default this option is disabled to preserve the current
behavior. However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea. We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#938
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout. The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:
SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed
The bug originates in this snippet from sa_find_sizes().
if (is_var_sz && var_size > 1) {
if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
*total < full_space) {
hdrsize += sizeof (uint16_t);
This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.
To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1250
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't. A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.
The following reproducer for this bug was isolated:
truncate -s 128m /tmp/tank
zpool create tank /tmp/tank
zfs create -o xattr=sa tank/fish
ln -s `perl -e 'print "z" x 41'` /tank/fish/z
setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z
This test results in roughly the following system attribute (SA)
layout:
176 bytes - "standard" SA's
41 bytes - name of symbolic link target
100 bytes - XDR encoded nvlist for xattr
---
317 bytes - total
Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.
Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.
To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.
To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1240
3035 LZ4 compression support in ZFS and GRUB
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>
References:
illumos/illumos-gate@a6f561b4aehttps://www.illumos.org/issues/3035http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS
This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux. Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.
Support for GRUB has been dropped from this patch. That code
is available but those changes will need to made to the upstream
GRUB package.
Lastly, several hunks of dead code were dropped for clarity. They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.
Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1217
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1244
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation. There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.
https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1238
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table. This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1230
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success. All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not. This may block writers or cause VERIFY failures when the
lock is freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1215Closeszfsonlinux/spl#143Closeszfsonlinux/spl#97
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter. In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.
Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1226
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1223
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection". By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1220
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.
These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.
Unfortunately, the deadlock described in #1176 was a case
which wasn't considered. At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe. The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.
Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1176
Issue #457
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#795
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created. This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.
The bug can be reproduced as follows (the second ls should succeed):
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
$ zfs snap tank@s
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
To remedy this, always invalidate cached dentries in the snapshot
control directory. Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1192
A misplaced single quote caused the umount command to fail with a
syntax error when unmounting snapshots under the .zfs/snapshot
control directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1210
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@8f0b538d1d
changeset: 13818:e9ad0a945d45
https://www.illumos.org/issues/3189
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2618 arc.c mistypes in the comments
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@fc98fea58e
illumos changeset: 13721:5b51a16a186f
https://www.illumos.org/issues/2618
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process). A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.
One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot. It updates the vfsmount such that it references the
original dataset vfsmount. The result is that the snapshot itself
isn't visible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>