Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
Currently, for unknown reasons, VOP_CLOSE() is a no-op in userspace.
This causes file descriptor leaks. This is especially problematic with
long ztest runs, since zpool.cache is opened repeatedly and never
closed, resulting in resource exhaustion (EMFILE errors).
This patch fixes the issue by making VOP_CLOSE() do what it is supposed
to do.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
Currently, thread_create(), when called in userspace, creates a
joinable (i.e. not detached thread). This is the pthread default.
Unfortunately, this does not reproduce kthreads behavior (kthreads
are always detached). In addition, this contradicts the original
Solaris code which creates userspace threads in detached mode.
These joinable threads are never joined, which leads to a leakage of
pthread thread objects ("zombie threads"). This in turn results in
excessive ressource consumption, and possible ressource exhaustion in
extreme cases (e.g. long ztest runs).
This patch fixes the issue by creating userspace threads in detached
mode. The only exception is ztest worker threads which are meant to be
joinable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.
Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#906
Currently, mkfs.ext2 on zconfig.sh zvols tries to use a 8K blocksize,
probably because by default zvol exposes an optimal I/O size of 8K.
Unfortunately, a ext2 blocksize of 8K is not supported by the kernel,
so the resulting filesystem is unmountable.
This patch fixes the issue by making sure the blocksize is 4K. We have
to use -F to force it else mkfs.ext2 won't allow us to use a blocksize
smaller than the optimal I/O size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#979
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook. All it takes there
is to make sure changes are committed to ZIL. Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#969
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#949Closes#931Closes#789Closes#743Closes#730
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
Introduced by commit 44867b6d6e.
We should of course check to ensure best isn't NULL before
attempting to dereference it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#974
The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths. These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.
This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):
/dev/disk/by-vdev - Custom rules, use first if they exist
/dev/disk/zpool - Custom rules, use first if they exist
/dev/mapper - Use multipath devices before components
/dev/disk/by-uuid - Single unique entry and persistent
/dev/disk/by-id - May be multiple entries and persistent
/dev/disk/by-path - Encodes physical location and persistent
/dev/disk/by-label - Custom persistent labels
/dev - UNSAFE device names will change
The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable. This must be a colon
delimited list of paths which are searched for vdevs. If the
'zpool import -d' option is specified only those listed paths
will be searched.
Finally, when multiple paths to the same device are found. If one
of the paths is an exact match for the path used last time to import
the pool it will be used. When there are no exact matches the
prefered path will be determined by the provided search order.
This means you can still import a pool and force specific names by
providing the -d <path> option. And the prefered names will persist
as long as those paths exist on your system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#965
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.
The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#933
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.
The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit. Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#961
Without this fix the zdb printouts of ZIL data blocks look full of FF
due to printf() handling its arguments as int by default.
Here is the output before the fix
TX_WRITE len 4136, txg 1093817, seq 149231
foid 4242, offset 0, length f68
G FFFFFF8EFFFFFF87FFFFFF91FFFFFFCC 1c
FFFFFFAFFFFFFFC9FFFFFFBAZ FFFFFFC3
And the same after the fix
TX_WRITE len 4136, txg 1093817, seq 149231
foid 4242, offset 0, length f68
G 8E8791CC 1cAFC9BAZ C3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#962
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed. This will result in a
NULL dereference as shown by the stack trace is issue #782.
This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#782
This reverts commit 395350c85d which
accidentally introduced issue #955.
Pools using AF drives which were originally created with a sector
size of 512 bytes will now be correctly detected to have physical
sector size of 4096. This is desirable for a new pool, however for
an existing pool abruptly changing the sector size causes problems.
For this reason, this change is being reverted until the additional
logic can be added to detect the existing pool case. Existing
pools must use the ashift size stored in the label regardless of
what the disk reports. This is critical for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #955
Delay executing exportfs command until its results are actually
required.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
spl_config.h.in is a generated file: remove and .gitignore it
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
ztest outputs a message when testing sync=always no matter what the
verbosity level is. There is no point outputting this message for low
verbosity levels.
With this patch the message is only displayed at verbosity level 5 or
above. The result is less output pollution.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#951
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL. However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().
Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case. Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy(). However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.
This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.
https://www.illumos.org/issues/3170
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#888
In 1e33ac1e26, the maximum stack size for
userspace tools was set to 8k to mimic the available kernel stack size.
Unfortunately, due to differences in how the stack is used in userspace
vs kernel space, spurious stack overflows could occur in userspace
tools due to the limited stack size. This is especially true in ztest
when debugging is enabled.
This patch multiplies the userspace stack size by 4, which fixes the
stack overflow issues. This comes at the price of not being able to
catch stack size issues in userspace, but the previous solution proved
unreliable anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#934.
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.
These objects can be as large as order 5 on a system with 4k
pages. And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation. This can result in the occasional
allocation failure.
However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed. Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
Commit 858219c makes more sense down below in the 'if (verbose)'
section of the code. Initially, buf and path will never point
to the same location. Once 'path = buf' is set on a raidz vdev,
the code may drop into the verbose section depending on the
verbose flag. In here, using a tmpbuf makes sense since now
'buf == path'.
This issue does not occur in the upstream Solaris code because
their implementations of snprintf() allow for buf and path to
be the same address.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#57
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
Use the bdev_physical_block_size() interface to determine the
minimize write size which can be issued without incurring a
read-modify-write operation. This is used to set the ashift
correctly to prevent a performance penalty when using AF hard
disks.
Unfortunately, this interface isn't entirely reliable because
it's not uncommon for disks to misreport this value. For this
reason you may still need to manually set your ashift with:
zpool create -o ashift=12 ...
The solution to this in the upstream Illumos source was to add
a while list of known offending drives. Maintaining such a list
will be a burden, but it still may be worth doing if we can
detect a large number of these drives. This should be considered
as future work.
Reported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#916
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
Commit e6f290535c added libzpool to
the mount_zfs dependencies. This brought in the nvpair symbols
which are used by libzpool. To resolve this include the libnvpair
library for mount_zfs even though mount_zfs doesn't directly
require any of these symbols.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#926
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback. However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit. There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.
This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds. One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#700Closes#907
mount_zfs depends on libzpool for zfs_prop_written since
330d06f90d. Unfortunately, the Makefile
for mount_zfs has not been modified to reflect this. As a result,
libtool doesn't know about the dependency, which may result in the wrong
libzpool being used during the build (e.g. the libzpool from the system
instead of the libzpool from the build directory).
This patch adds the dependency to fix the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#909.
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.
This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected. In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected. When debugging is enabled
any misuse will be treated as a fatal error.
This patch is entirely for debugging. We should be careful to
NOT become dependant on it fixing up the incorrect allocations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests. Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...
1) These buffers are short lived. They are only required for
the life of a single I/O at which point they can be used by
the next I/O.
2) The maximum number of concurrent buffers needed by a vdev is
small. It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.
By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it. This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock. This is
particularly important when memory is already low on the system.
It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation. The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated. The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC. That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks. This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.
The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it. This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.
As such, we revert this patch in favor of a proper fix for both issues.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC. Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The autoconf macro which failed if CONFIG_PREEMPT was set in the kernel
config was removed. With the inclusion of a few previous patches
targeting support for preempt enabled kernels, it is now safe to run
with this kernel config option enabled.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#83
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries. However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has
.. as its final component. POSIX.1-2001 also allows EEXIST for
this condition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#895