LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic. Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().
This has the advantage of being a stable functionally equivalent
interface. On many architectures smp_mb__after_clear_bit() expands
to smp_wmb(). Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#386
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock. This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().
Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock. Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2670
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.
Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation. It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.
Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.
Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2619
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object. This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().
The good news is that is a situation which can not occur unless the
pool has been damaged. The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools. Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.
Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object. If a non-ZAP
object has been passed it EINVAL will be returned immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.
ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.
One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:
1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.
3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
4. Making an asynchronous write synchronous is non sequitur.
Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.
It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.
Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.
It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:
```
Which operations are cancelable is implementation-defined.
```
http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html
The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.
Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility. We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#223Closes#2373
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.
Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2652.
Closes#2525.
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2639
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.
Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2158
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.
Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations. This
was possible due to the following related changes.
e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457Closes#2058Closes#2128Closes#2240
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available. Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir(). This is shown below.
The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required. This could be
done but it would result in us diverging further from the other
implementations. Therefore I've opted against doing this
unless it becomes absolutely necessary.
Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close#2580
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage. Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.
To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure. All
access must be serialized through the vq_lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2572
For small objects the Linux slab allocator should be used to make the most
efficient use of the memory. However, large objects are not supported by
the Linux slab and therefore the SPL implementation is preferred. A cutoff
of 16K was determined to be optimal for architectures using 4K pages.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #356Closes#379
Reinstate the correct default behavior of returning the number of objects
in the cache for reclaim. This behavior was disabled in recent releases
to do occasional reports of spinning in shrink_slabs(). Those issues have
been resolved and can no longer can be reproduced. See commit 376dc35.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #358Closes#379
The dsl_dataset_rollback_check() function is executed in the
txg_sync context. To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE. This was introduced by
the recent 'zfs bookmark' features, commit da53684.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes#2569
4897 Space accounting mismatch in L2ARC/zpool
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
From the illumos issue tracker:
L2ARC vdev space usage statistics are calculated as the delta
between the maximum and minimum vdev offset ever written to
by the L2ARC fill thread, but do not inform the user of how
much space in between these two offsets is actually taken up by
cached buffers. This fix changes that so that vdev space usage
stats on L2ARC devices accurately track the volume of buffers
stored on them, allowing users to see the exact L2ARC usage in
"zpool iostat -v".
References:
https://www.illumos.org/issues/4897https://github.com/illumos/illumos-gate/commit/3038a2b
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2555
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue. Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#374
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2488
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488