Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL. The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC. The ZPL
would iterate over all ZFS super blocks and perform the reclaim.
For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems. This patch
is designed to address those issues.
1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback. The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.
2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality. This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed. This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes. This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.
3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback. This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache. If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
The arc_meta_max value should be increased when space it consumed not when
it is returned. This ensure's that arc_meta_max is always up to date.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5630https://github.com/illumos/illumos-gate/commit/cd485b4
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
5047 don't use atomic_*_nv if you discard the return value
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5047https://github.com/illumos/illumos-gate/commit/640c167
Porting Notes:
Several hunks from the original patch where not specific to ZFS
and thus were dropped.
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3172
Allowing direct reclaim to re-enter the VFS in the zfs_inactive()
call path has historically been problematic for ZoL. Therefore,
in order to avoid an entire class of current and future issues
caused by this PF_FSTRANS is set for all zfs_inactive() callers.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3163
Avoid issuing I/O to the pool when retrieving feature flags information.
Trying to read the ZAPs from disk means that zpool clear would hang if
the pool is suspended and recovery would require a reboot. To keep the
feature stats resident in memory, we hang a cached nvlist off of the
spa. It is built up from disk the first time spa_add_feature_stats() is
called, and refreshed thereafter using the cached feature reference
counts. spa_add_feature_stats() gets called at pool import time so we
can be sure the cached nvlist will be available if the pool is later
suspended.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3082
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible. This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1445
The 'capabilities' argument which was passed to bdi_setup_and_register()
has been removed. File systems should no longer pass BDI_CAP_MAP_COPY.
For our purposes this means there are now three different interfaces
which must be handled. A zpl_bdi_setup_and_register() wrapper function
has been introduced to provide a single interface to the ZPL code.
* 2.6.32 - 2.6.33, bdi_setup_and_register() is not exported.
* 2.6.34 - 3.19, bdi_setup_and_register() takes 3 arguments.
* 4.0 - x.y, bdi_setup_and_register() takes 2 arguments.
I've also taken this opportunity to remove HAVE_BDI because kernels
older then 2.6.32 are no longer supported. All kernels newer than
this will have one of the above interfaces.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#3128
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held. The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.
1) It would require changes to a significant amount of the ZFS
code. This would complicate applying patches from upstream.
2) It would be easy to accidentally miss a critical region in
the initial patch or to have an future change introduce a
new one.
Both of these concerns can be addressed by using a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit zfsonlinux/spl@9099312 - Merge branch
'kmem-rework'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3050Closes#3055Closes#3062Closes#3132Closes#3142Closes#2983
Pool reference count is NOT checked in spa_export_common()
if the pool has been imported readonly=on, i.e. spa->spa_sync_on
is FALSE. Then zpool export and zfs list may deadlock:
1. Pool A is imported readonly.
2. zpool export A and zfs list are run concurrently.
3. zfs command gets reference on the spa, which holds a dbuf on
on the MOS meta dnode.
4. zpool command grabs spa_namespace_lock, and tries to evict dbufs
of the MOS meta dnode. The dbuf held by zfs command can't be
evicted as its reference count is not 0.
5. zpool command blocks in dnode_special_close() waiting for the
MOS meta dnode reference count to drop to 0, with
spa_namespace_lock held.
6. zfs command tries to get the spa_namespace_lock with a reference
on the spa held, which holds a dbuf on the MOS meta dnode.
7. Now zpool command and zfs command deadlock each other.
Also any further zfs/zpool command will block on spa_namespace_lock
forever.
The fix is to always check pool reference count in spa_export_common(),
no matter whether the pool was imported readonly or not.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2034
Cleanly destroying or exporting a pool requires that the pool
not be suspended. Therefore, set the POOL_CHECK_SUSPENDED flag
for these ioctls so the utilities will output a descriptive
error message rather than block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2878
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup. This was done to abstract
away any compatibility code which might be needed for the SPL.
As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2985
As described in flags section of open(2):
O_APPEND:
The file is opened in append mode. Before each write(2), the
file offset is positioned at the end of the file, as if with
lseek(2). O_APPEND may lead to corrupted files on NFS filesys-
tems if more than one process appends data to a file at once.
This is because NFS does not support appending to a file, so the
client kernel has to simulate it, which can't be done without a
race condition.
This issue was originally overlooked because normally the generic
VFS code handles this for a filesystem. However, because ZFS explictly
registers a zpl_write() function it's responsible for the seek.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3124
When loading the ZFS kernel modules they should not populate the
spa namespace using the cache file. This behavior isn't consistent
with other Linux kernel modules and we need to move away from it.
Removing this makes the whole startup process predictable with four
basic steps which are driven by the init system.
1) modprobe
2) zpool import
3) zfs mount
4) zfs share
This change also helps lay the groundwork for eventually removing
the kobj_* compatibility code on the kernel side. It may need to
be preserved in userspace because libzfs_init() depends on it.
This is why the conditional must be wrapped with an #ifdef _KERNEL.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2820
When a bad DVA is encountered in metaslab_free_dva() the system
should treat it as fatal. This indicates that somehow a damaged
DVA was written to disk and that should be impossible.
However, we have seen a handful of reports over the years of pools
somehow being damaged in this way. Since this damage can render
otherwise intact pools unimportable, and the consequence of skipping
the bad DVA is only leaked free space, it makes sense to provide
a mechanism to ignore the bad DVA. Setting the zfs_recover=1 module
option will cause the DVA to be ignored which may allow the pool to
be imported.
Since zfs_recover=0 by default any pool attempting to free a bad DVA
will treat it as a fatal error preserving the current behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3099
Issue #3090
Issue #2720
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.
This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2930
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code. They were introduced as a workaround
to avoid issue #2523. Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches. Today
this code works well and there's no compelling reason to keep
this functionality. In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)
Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3084
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold. Convert them to use vmem_alloc() so the
best allocator is used.
Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings. Concurrently the large allocation warning threshold
was increased from 8k to 32k. The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k. This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3057Closes#3079Closes#3081
Normally when importing a pool the space maps for all top level
vdevs are read from disk. The space maps will be required latter
when an allocation is performed and free blocks need to be located.
However, if the pool is imported readonly then we are guaranteed
that no allocations can occur. In this case the space maps need
not be loaded.. A similar argument can be made for the DTLs
(dirty time logs).
Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://github.com/illumos/illumos-gate/commit/2a89c2chttps://www.illumos.org/issues/5311
Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2970
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.
ln -s $(perl -e 'print "z" x 20') file
setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file
Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.
Also improve the comments in sa_find_sizes().
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3070
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list. In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2553Closes#2495
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller. This caused spurious
ENOMEM errors when reading the dbufs kstat.
This commit causes the actual number of characters written to be returned.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3072
Commit log from FreeBSD:
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write. Also, we have observed that arc_hdr_destroy()
can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
flag in similar circumstances.
Previously the l2arc headers would be freed while leaking their
associated compression buffers. Now the buffers are placed on
l2arc_free_on_write list for delayed freeing. This is similar to
what was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.
In addition to fixing the discovered leaks this change also adds
some protective code to assert that a compression buffer associated
with a l2arc header is never leaked.
A new kstat l2_cdata_free_on_write is added. It keeps a count
of delayed compression buffer frees which previously would have
been leaks.
Tested by: Vitalij Satanivskij <satan@ukr.net> et al
Requested by: many
MFC after: 2 weeks
Sponsored by: HybridCluster / ClusterHQ
References:
https://illumos.org/issues/5222https://github.com/freebsd/freebsd/commit/b98f85dhttp://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781http://lists.open-zfs.org/pipermail/developer/2014-January/000455.htmlhttp://lists.open-zfs.org/pipermail/developer/2014-February/000523.html
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3029
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K. This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.
In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure. This way we're always sure the
right allocation functions are used. Notice that in the current
code kmem_free() and vmem_free() were both used. This happened to
work because these wrappers map to the same internal SPL function.
This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings. There
was no deadlock here just repeated allocation failures resulting
in increased latency.
Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P O 3.2.0-0.bpo.4-amd64
Call Trace:
[<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
[<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
[<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
[<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
[<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
[<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
[<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
[<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
[<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
[<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3059
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool. Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().
In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1343
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1343
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR(). The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3028
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.
It also brings this function almost back in sync with Illumos. This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2791
Issue #2781
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.
This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2975
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations. Instead a small dedicated
cache of preallocated SA buffers was kept.
This solution was viable while the maximum block size was limited
to 128K. But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc(). However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.
Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.
This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit 86dd0fd added preallocated I/O buffers. This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos. A
deadlock in this situation is no longer possible.
However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky. Either case is acceptable
here because we can safely abort the aggregation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings
us back in line with upstream. In some cases this means simply
swapping the flags back. For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.
The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory. This is again the
same as upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate. The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.
A careful reader will notice that not all callers have been changed
to vmem_alloc(). Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k. This is desirable because it minimizes the need
for Linux specific code changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim. This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction. For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range(). Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read(). The page must be unlocked before taking
the range lock. This patch corrects that issue.
In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter. Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.
zfs_flags (int)
Set additional debugging flags. The following flags may be
bitwise-or'd together.
+-------------------------------------------------------+
|Value Symbolic Name |
| Description |
+-------------------------------------------------------+
| 1 ZFS_DEBUG_DPRINTF |
| Enable dprintf entries in the debug log. |
+-------------------------------------------------------+
| 2 ZFS_DEBUG_DBUF_VERIFY * |
| Enable extra dbuf verifications. |
+-------------------------------------------------------+
| 4 ZFS_DEBUG_DNODE_VERIFY * |
| Enable extra dnode verifications. |
+-------------------------------------------------------+
| 8 ZFS_DEBUG_SNAPNAMES |
| Enable snapshot name verification. |
+-------------------------------------------------------+
| 16 ZFS_DEBUG_MODIFY |
| Check for illegally modified ARC buffers. |
+-------------------------------------------------------+
| 32 ZFS_DEBUG_SPA |
| Enable spa_dbgmsg entries in the debug log. |
+-------------------------------------------------------+
| 64 ZFS_DEBUG_ZIO_FREE |
| Enable verification of block frees. |
+-------------------------------------------------------+
| 128 ZFS_DEBUG_HISTOGRAM_VERIFY |
| Enable extra spacemap histogram verifications. |
+-------------------------------------------------------+
* Requires debug build.
Default value: 0.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2988
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.
Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.
These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock. To
prevent this we must always manipulate the writeback bit while
holding the range lock. The exact deadlock is as follows:
------ Process A ------ ------ Process B ------
zpl_writepages zpl_fallocate
write_cache_pages zpl_fallocate_common
zpl_putpage zfs_space
zfs_putpage (set bit) zfs_freesp
zfs_range_lock (wait on lock) zfs_free_range (take lock)
[has not yet initiated I/O, truncate_inode_pages_range
the bit will not be cleared] wait_on_page_writeback (wait on bit)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode. This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2812
Mark the error handling branch as unlikely() because the current
kernel interface can never return NULL. However, we want to keep
the error handling in case this behavior changes in the futre.
Plus fix a small style issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#2703
Inclusion of SPL compatibility headers was moved out of the public
header sys/types.h to avoid conflicts with external packages. Include a
few compatiblity headers explicitly to cope with that change. Also,
sort some linux-specific inclusions alphabetically.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2898
Fix a few cases where null-byte termination of strings was done
unnecessarily or incorrectly.
- The snprintf() function always produces a null-byte terminated string
for non-negative return values, so it is not necessary to write out a
null-byte as a separate step.
- Also, it is unsafe to use the return value of snprintf() as an offset
for placing a null-byte, because if the output was truncated the return
value is the number of bytes that _would_ have been written had enough
space been available. Therefore the return value may index beyond the
array boundaries.
- Finally, snprintf() accounts for the null-byte when limiting its output
size, so there is no need to pass it a size parameter that is one less
than the buffer size.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2875
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.
In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query
the pool status during the hung import would not return as the import
holds the pool lock.
The only way to import such a pool would be to specify -o readonly=on
to the zpool import.
zdb -bb <pool> can be used to check for "deferred free" size which is
where this lost space will be counted.
References:
https://github.com/freebsd/freebsd/commit/48431b7http://svnweb.freebsd.org/base?view=revision&revision=273158https://reviews.csiden.org/r/132/
Porting notes:
This issue was filed as illumos 5347 and a more comprehensive fix is
under review. Once that change is finalized it will be integrated, in
the meanwhile the FreeBSD fix has been merged to prevent the issue.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2896
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written. This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.
The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:
setfattr -n user.test -v very_very_very..._long_value <file>
setfattr -n user.test -v short_value <file>
The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2663Closes#2700Closes#2701Closes#2717Closes#2863Closes#2884
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix a few dprintf format specifiers that disagreed with their argument
types. These came to light as compiler errors when converting dprintf
to use the Linux trace buffer. Previously this wasn't a problem,
presumably because the SPL debug logging uses vsnprintf which must
perform automatic type conversion.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a new file named arc_impl.h and move a few internal
ARC structure definitions into this file. This is
needed in order to allow the Linux tracepoint functions to grub
around in the internals of these structures.
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Due to evidence of contention both the buf_hash_table and the
dbuf_hash_table sizes have been increased from 256 to 8192.
This increase in hash table size adds approximating 0.5M to
our fixed memory footprint. This relatively small increase
is not expected to cause problems even on low memory machines.
This footprint will also become dynamic when the persistent
L2ARC support is finalized. In the meanwhile, this small
change significantly reduces contention for certain workloads.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes#1291
These symbols are needed by consumers (i.e. Lustre) who wish to
integrate with the ZIL. In addition the zil_rollback_destroy()
prototype was removed because the implementation of this function
was removed long ago.
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2892
When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.
Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2822
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects. It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.
This patch adds support for the new @scan_objects return value semantics.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2837
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5164https://www.illumos.org/issues/5165https://github.com/illumos/illumos-gate/commit/b1be289
Porting Notes:
The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.
The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz. The upstream commit incorrectly
used space_map_blksize.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4958https://github.com/illumos/illumos-gate/commit/2a104a5
Porting notes:
Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present
in Linux but not yet in *BSD.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
The general strategy used by ZFS to verify that blocks are valid is
to checksum everything. This has the advantage of being extremely
robust and generically applicable regardless of the contents of
the block. If a blocks checksum is valid then its contents are
trusted by the higher layers.
This system works exceptionally well as long as bad data is never
written with a valid checksum. If this does somehow occur due to
a software bug or a memory bit-flip on a non-ECC system it may
result in kernel panic.
One such place where this could occur is if somehow the logical
size stored in a block pointer exceeds the maximum block size.
This will result in an attempt to allocate a buffer greater than
the maximum block size causing a system panic.
To prevent this from happening the arc_read() function has been
updated to detect this specific case. If a block pointer with an
invalid logical size is passed it will treat the block as if it
contained a checksum error.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2678
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp(). Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose. Rather than emulating those interfaces we remove the
redundant checks.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2804
5162 zfs recv should use loaned arc buffer to avoid copy
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5162https://github.com/illumos/illumos-gate/commit/8a90470
Porting notes:
Fix spelling error 's/arena/area/' in dmu.c.
In restore_write() declare bonus and abuf at the top of the function.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2696
When a clone is created of a snapshot that has been marked for
deferred destroy (with "zfs destroy -d"), the clone "inherits" the
defer_destroy flag from the origin, and any snapshots of the clone
"inherit" the defer_destroy flag from the clone. This causes a strange
situation where the clone's snapshots are marked for defer_destroy but
they have no holds or clones. If the clone's snapshot gets a hold or
clone, which is then deleted, we will honor the incorrectly-set
defer_destroy flag and delete the snapshot!
Steps to reproduce:
* zpool create test c1t1d0
* zfs create test/fs
* zfs snapshot test/fs@a
* zfs clone test/fs@a test/clone
* zfs destroy -d test/fs@a
* zfs clone test/fs@a test/clone2
* zfs snapshot test/clone2@a
* zfs hold hld test/clone2@a
* zfs release hld test/clone2@a
* zfs list -r -t all test
<test/clone2@a has been destroyed>
We noticed that this causes dcenter to get very confused, because it
treats snapshots that are marked defer_destroy as not existing. So it
won't see any snapshots of the clone that's marked defer_destroy.
5150 - zfs clone of a defer_destroy snapshot causes strangeness
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/projects/illumos-gate//issues/5150https://github.com/illumos/illumos-gate/commit/42fcb65
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2690
Restore_object should not use two transactions to restore an object:
* one transaction is used for dmu_object_claim
* another transaction is used to set compression, checksum and most
importantly bonus data
* furthermore dmu_object_reclaim internally uses multiple transactions
* dmu_free_long_range frees chunks in separate transactions
* dnode_reallocate is executed in a distinct transaction
The fact the dnode_allocate/dnode_reallocate are executed in one
transaction and bonus (re-)population is executed in a different
transaction may lead to violation of ZFS consistency assertions if the
transactions are assigned to different transaction groups. Also, if
the first transaction group is successfully written to a permanent
storage, but the second transaction is lost, then an invalid dnode may
be created on the stable storage.
3693 restore_object uses at least two transactions to restore an object
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <andriy.gapon@hybridcluster.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Original authors: Matthew Ahrens and Andriy Gapon
References:
https://www.illumos.org/issues/3693https://github.com/illumos/illumos-gate/commit/e77d42e
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2689
In zfs_acl_chown_setattr(), the zfs_mode_comput() function is used to
create a traditional mode value based on an ACL. If no ACL exists, this
processing shouldn't be done. Problems caused by this were most evident
on version 4 filesystems which not only don't have system attributes,
and also frequently have empty ACLs. On such filesystems, performing a
chown() operation could have the effect of dirtying the mode bits in
memory but not on the file system as follows:
# create a file with typical mode of 664
echo test > test
chown anyuser test
ls -l test
and the mode will show up as all zeroes. Unmounting/mounting and/or
exporting/importing the filesystem will reveal the proper mode again.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1264
Reviewed by Matthew Ahrens <mahrens@delphix.com>
Reviewed by Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://github.com/illumos/illumos-gate/commit/b8289d2https://www.illumos.org/issues/3756
Porting notes:
The static function zfs_prop_activate_feature() was removed because
this change removes the only caller. The function was not removed
from Illumos but instead left as dead code. However, to keep gcc
happy it was removed from Linux and may be easily restored if needed.
Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1540
The new zpl_aio_write() and zpl_aio_read() functions use kmem_alloc()
to allocate enough memory to hold the vectorized IO. While this
allocation will be small it's been observed in practice to sometimes
slightly exceed the 8K warning threshold by a few kilobytes.
Therefore, the KM_NODEBUG flag has been added to suppress warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2774
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev. In
this case the child should be skipped and the read issues to
another member of the mirror.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1744
Modify the code to use the utsname() kernel function rather than
a global variable. This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2). This means that it
will behave consistently in both contexts.
This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions. And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.
Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
This functionality is optional and until Linux 3.0, which
provided per-filesystem shinkers, they was never a reasonable
interface. Therefore, this functionality is being dropped
for earlier kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
When ZPIOS was originally written it was designed to use the
device_create() and device_destroy() functions. Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.
As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions. This interface
for registering character devices has remained stable, is simple,
and provides everything we need.
Therefore the code has been reworked to use this interface. The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
This is a debug patch designed to ensure an error code is logged
to the console when this VERIFY() is hit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #1440
Commit e022864 introduced a regression for kernels which are built
with CONFIG_DEBUG_PREEMPT. The use of CPU_SEQID in a preemptible
context causes zio_nowait() to trigger the BUG. Since CPU_SEQID
is simply being used as a random index the usage here is safe. To
resolve the issue preempt is disable while calling CPU_SEQID.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2769
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5176https://github.com/illumos/illumos-gate/commit/6f834bc
Porting notes:
Under Linux max_ncpus is defined as num_possible_cpus(). This is
largest number of cpu ids which might be available during the life
time of the system boot. This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2711
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.
26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.
This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161f
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp(). This left a window of opportunity during which
the full pages could be touched.
This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2733
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5138https://github.com/illumos/illumos-gate/commit/af3465d
Porting notes:
Because support for exposing a uint64_t parameter wasn't added
until v3.17-rc1 the zfs_free_max_blocks variable has been declared
as a unsigned long. This is already far larger than required and
it allows us to avoid additional autoconf compatibility code.
The default value has been set to 100,000 on Linux instead of
ULONG_MAX which is used on Illumos. This was done to limit the
number of outstanding IOs in the system when snapshots are destroyed.
This helps ensure individual TXG sync times are kept reasonable and
memory isn't wasted managing a huge backlog of outstanding IOs.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2675Closes#2581
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4753https://github.com/illumos/illumos-gate/commit/73527f4
Comments by Matt Ahrens from the issue tracker:
When a sync task is waiting for a txg to complete, we should hurry
it along by increasing the number of outstanding async writes
(i.e. make vdev_queue_max_async_writes() return a larger number).
Initially we might just have a tunable for "minimum async writes
while a synctask is waiting" and set it to 3.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2716
5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5139https://github.com/illumos/illumos-gate/commit/0fbc0cd
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2714
LLVM's static analyzer reported that we could pass an uninitialized
pool_guid to spa_by_guid() in vdev_inuse(). Upon review, it is correct.
An attempt to repurpose a spare or L2ARC drive from an exported pool
will cause the pool_guid passed to spa_by_guid() to be unintialized
information from the stack. This will cause non-deterministic behavior.
Since there is no reason why we cannot repurpose such disks, we modify
vdev_inuse() to avoid calling spa_by_guid() when they are detected.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
5161 add tunable for number of metaslabs per vdev
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5161https://github.com/illumos/illumos-gate/commit/bf3e216
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2698
5177 remove dead code from dsl_scan.c
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5177https://github.com/illumos/illumos-gate/commit/5f37736
Porting notes:
The local variable 'buf' was removed from dsl_scan_visitbp().
This wasn't part of the original patch but it should have been.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2712
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/projects/illumos-gate//issues/5140https://github.com/illumos/illumos-gate/commit/2243853
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2676
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock. This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().
Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock. Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2670
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.
Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation. It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.
Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.
Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#2619
5117 space map reallocation can cause corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/projects/illumos-gate/issues/5117https://github.com/illumos/illumos-gate/commit/e503a68
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2662
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object. This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().
The good news is that is a situation which can not occur unless the
pool has been damaged. The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools. Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.
Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object. If a non-ZAP
object has been passed it EINVAL will be returned immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.
ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.
One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:
1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.
2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.
3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.
4. Making an asynchronous write synchronous is non sequitur.
Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.
It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.
Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.
It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:
```
Which operations are cancelable is implementation-defined.
```
http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html
The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.
Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility. We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#223Closes#2373
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.
Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2652.
Closes#2525.
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2639
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.
Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4970https://www.illumos.org/issues/4971https://www.illumos.org/issues/4972https://www.illumos.org/issues/4973https://www.illumos.org/issues/4974https://github.com/illumos/illumos-gate/commit/e42d205
Notes:
This set of patches adds a set of tunable parameters for the
"extreme rewind" mode of pool import which allows control over
the traversal performed during such an import.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2598
5034 ARC's buf_hash_table is too small
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5034https://github.com/illumos/illumos-gate/commit/63e911b
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2615
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2158
4631 zvol_get_stats triggering too many reads
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4631https://github.com/illumos/illumos-gate/commit/bbfa8ea
Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2612Closes#2480
Illumos 4982 added code to metaslab_fragmentation() to proactively update
space maps when the spacemap_histogram feature is enabled. This should
only happen when the pool is writeable.
References:
https://www.illumos.org/issues/4982https://github.com/illumos/illumos-gate/commit/2e4c998
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4976https://www.illumos.org/issues/4978https://www.illumos.org/issues/4979https://www.illumos.org/issues/4980https://www.illumos.org/issues/4981https://www.illumos.org/issues/4982https://www.illumos.org/issues/4983https://www.illumos.org/issues/4984https://github.com/illumos/illumos-gate/commit/2e4c998
Notes:
The "zdb -M" option has been re-tasked to display the new metaslab
fragmentation metric and the new "zdb -I" option is used to control
the maximum number of in-flight I/Os.
The new fragmentation metric is derived from the space map histogram
which has been rolled up to the vdev and pool level and is presented
to the user via "zpool list".
Add a number of module parameters related to the new metaslab weighting
logic.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.
Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations. This
was possible due to the following related changes.
e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457Closes#2058Closes#2128Closes#2240
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available. Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir(). This is shown below.
The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required. This could be
done but it would result in us diverging further from the other
implementations. Therefore I've opted against doing this
unless it becomes absolutely necessary.
Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close#2580
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage. Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.
To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure. All
access must be serialized through the vq_lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2572
The dsl_dataset_rollback_check() function is executed in the
txg_sync context. To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE. This was introduced by
the recent 'zfs bookmark' features, commit da53684.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes#2569
4914 zfs on-disk bookmark structure should be named *_phys_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4914https://github.com/illumos/illumos-gate/commit/7802d7b
Porting notes:
There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated. This should reduce the likelihood of further
problems like issue #2094 from occurring.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2558
4881 zfs send performance degradation when embedded block pointers
are encountered
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4881https://github.com/illumos/illumos-gate/commit/06315b7
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2547
4897 Space accounting mismatch in L2ARC/zpool
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
From the illumos issue tracker:
L2ARC vdev space usage statistics are calculated as the delta
between the maximum and minimum vdev offset ever written to
by the L2ARC fill thread, but do not inform the user of how
much space in between these two offsets is actually taken up by
cached buffers. This fix changes that so that vdev space usage
stats on L2ARC devices accurately track the volume of buffers
stored on them, allowing users to see the exact L2ARC usage in
"zpool iostat -v".
References:
https://www.illumos.org/issues/4897https://github.com/illumos/illumos-gate/commit/3038a2b
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2555
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4757https://www.illumos.org/issues/4913https://github.com/illumos/illumos-gate/commit/5d7b4d4
Porting notes:
For compatibility with the fastpath code the zio_done() function
needed to be updated. Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2544
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4754https://www.illumos.org/issues/4755https://github.com/illumos/illumos-gate/commit/b6240e8
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2533
4374 dn_free_ranges should use range_tree_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4374https://github.com/illumos/illumos-gate/commit/bf16b11
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2531
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4370https://www.illumos.org/issues/4371https://github.com/illumos/illumos-gate/commit/43466aa
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2529
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4171https://www.illumos.org/issues/4172https://github.com/illumos/illumos-gate/commit/2acef22
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2528
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2488
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.
References:
https://www.illumos.org/issues/4756https://github.com/illumos/illumos-gate/commit/30beaff
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4730https://github.com/illumos/illumos-gate/commit/be08211
Porting notes:
Under ZFSonlinux, one of the effects of not destroying the taskq is that
zdb would never exit (due to the SPL taskq implementation).
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
This changes moves the called to metaslab_group_alloc_update() to the
metaslab_sync_reassess() function. The original placement of the call
within metaslab_sync_done() appears to have been a simple mistake,
introduced by ac72fac3ea.
This aligns us more closely to the upstream illumos code base.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot. If they do somehow get
dirtied an attempt will make made to write them to disk. In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2405
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4168https://www.illumos.org/issues/4169https://www.illumos.org/issues/4170https://github.com/illumos/illumos-gate/commit/7fdd916
Porting notes:
Of particular interest when troubleshooting corrupted pools, the
commonly-used "zdb -e" operation may perform verbatim imports and
furthermore, it will soon have direct support for verbatim imports via
a new "-V" option. The 4169 fix eliminates a common segfault case in
which spa_history_log_version() tries to access an un-opened dsl_pool_t.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2451Closes#2283Closes#2467
The parameter was added as illumos issue 4081 which was committed to
zfsonlinux in ac72fac3ea. This patch
documents the parameter and allows for it to be set as a module parameter.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2483
If the smaller of 2 different sized child vdev's of a mirrored vdev is
detached, and the pool has the autoexpand property set to off, as the
remaining larger vdev is promoted to a top level vdev it fails to retain
the asize of the original top level mirror vdev and therefore partially
autoexpands.
This partially autoexpanded state leaves the new vdev too large to
re-mirror by adding the smaller vdev back in, and the pool fails to
utilize the space until next imported.
If the autoexpand property is set to on, the child vdev grows
in size after it has been promoted to a top level vdev as expected.
This commit causes the remaining child mirror to retain the asize of its
old parent mirror vdev if the autoexpand property is set to off,
this allows the smaller vdev to be re-added if required the vdev
can then be told to expand if required by the usual using zpool online -e.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#1208
4936 lz4 could theoretically overflow a pointer with a certain input
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported by: Tim Chase <tim@chase2k.com>
References:
https://illumos.org/issues/4936https://github.com/illumos/illumos-gate/commit/58d0718
Porting notes:
This fixes the widely-reported "20-year-old vulnerability" in
LZO/LZ4 implementations which inherited said bug from the reference
implementation.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2429
Added several comments regarding the removal of real_LZ4_uncompress()
which exists in the reference implementation but has been removed here
since it's not used.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This reverts commit 91579709fc which
limited the asynchronous dispatch to kernel space. We want to do
this for two reasons:
1) While we have slightly more headroom in user space excessively
deep stacks have been observed while running ztest, see #2293.
2) Removing this conditional makes the pipeline behave consistently
regardless of if it's executing in kernel space or user space.
This way we're more likely to uncover subtle issues with ztest.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2384
We should not override the default memory type of the kmem cache. This
was done previously to force certain objects which were slightly over
object size limit cut off in to KMC_KMEM caches for better performance.
The zfsonlinux/spl#356 patch slightly increases the default cut off
from 511 bytes 1024 bytes for x86_64. This means there is long longer
a need to override the default for the caches. And since the default
values are now being used the new spl_kmem_cache_slab_limit and
spl_kmem_cache_kmem_limit tunables will apply to all kmem caches.
The following is a list of caches that will be impacted:
| object size | forced type | default type
----------------- | ------------- | ------------- | --------------
dnode_t | 936 bytes | KMC_KMEM | KMC_KMEM
zio_cache | 1104 bytes | *KMC_KMEM | *KMC_VMEM
zio_link_cache | 48 bytes | KMC_KMEM | KMC_KMEM
zio_vdev_cache | 131088 bytes | KMC_VMEM | KMC_VMEM
zio_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_512 | 512 bytes | KMC_KMEM | KMC_KMEM
zio_buf_1024 | 1024 bytes | KMC_KMEM | KMC_KMEM
zio_data_buf_1024 | 1024 bytes | +KMC_VMEM | +KMC_KMEM
* Cache memory type will change from KMC_KMEM to KMC_VMEM.
+ Cache memory type will change from KMC_VMEM to KMC_KMEM.
This patch removes another slight point of divergence between ZoL
and Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2337
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.
Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2336
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA. This problem could be
reproduced when xattr=sa enabled as follows:
ln -s $(perl -e 'print "x" x 120') blah
setfattr -n security.selinux -v blahblah -h blah
The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]
Adding the SA xattr will attempt to extend the registration to:
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]
but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:
[ ZPL_SYMLINK ZPL_DXATTR ]
This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2301
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.
We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2270
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes#1707Closes#1987Closes#1891
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Today the metaslab_debug logic performs two tasks:
- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync
This change provides knobs for each of these independently.
References:
https://illumos.org/issues/4101https://github.com/illumos/illumos-gate/commit/0713e23
Notes:
1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2227
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated. These spill blocks are currently considered data
blocks. However, they should be accounted for as meta data
because they are effectively an extension of the dnode.
While this may seem like a minor accounting issue it has broader
implications. The key thing to be aware of is that each spill
block will hold a reference on its parent dnode. The dnode in
turn holds a reference on its dbuf in the dnode object. This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data. This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.
However, unlike the small file case a spill block can legitimately
be considered meta data. By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached. This then allows the dnodes and dbufs which the spill
block was pinning to be released.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2294
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space. If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned. Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* It will only fail if we're truly out of space (or over quota).
* ...
*/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2287
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().
Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.
References:
https://bugs.gentoo.org/show_bug.cgi?id=483516
Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1691
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.
This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2277
Endianness detection in LZ4 is broken in user-space builds. This
bug corrupts compressed data and manifests itself in several ztest
failures. When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.
The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.
While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#1963Fixes#1964Fixes#1965
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue. All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.
In this case the code used strfree() to release the memory allocated
by kmem_alloc(). Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.
To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory. Like strfree(), strdup() is
not integrated with the memory accounting. This means we can use
strfree() to release it like Illumos.
SPL: kmem leaked 10/4368729 bytes
address size data func:line
ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#2262
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2142
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.
This reverts commit e26ade5101.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted. If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources. Unfortunately, this introduced another deadlock.
INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
Tainted: P O 3.13.6 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
Call Trace:
[<ffffffff813dd1d4>] schedule+0x24/0x70
[<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[<ffffffff811608c5>] ilookup+0x65/0xd0
[<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[<ffffffff811071ec>] do_writepages+0x1c/0x40
[<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[<ffffffff8106072c>] process_one_work+0x16c/0x490
[<ffffffff810613f3>] worker_thread+0x113/0x390
[<ffffffff81066edf>] kthread+0xdf/0x100
This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks. Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used. This gives us the information that ilookup() provided without
the risk of a deadlock.
Alternately, this race could be closed by registering an sops->drop_inode()
callback. The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.
To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev. This covers both io and checksum zevents but
may be used but other scripts.
In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:
vdev_spare_paths - List of vdev paths for all hot spares.
vdev_spare_guids - List of vdev guids for all hot spares.
vdev_read_errors - Read errors for the problematic vdev
vdev_write_errors - Write errors for the problematic vdev
vdev_cksum_errors - Checksum errors for the problematic vdev.
By default the required hot spare scripts are installed but this
functionality is disabled. To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.
These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system. It also requires that the
autoexpand policy be ported from Illumos, see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c
Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location. When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor. Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor. When the file descriptor is
closed the kernel state tracking the cursor is destroyed.
This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,
This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2156
4088 use after free in arc_release()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4088illumos/illumos-gate@ccc22e1304
From the illumos issue:
A race-induced use after free occurs in arc_release() where the
ARC header is used outside the critical section protected by the
hash_lock.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2162
The spa_label_features nvlist is used in the sync context during pool
version upgrade.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2168
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign(). This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.
One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded. However, this case should not often arise in practice,
because most callers currently use one of these forms:
dmu_tx_assign(tx, TXG_WAIT);
dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed. Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.
Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram. Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1915
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.
Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead. To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io(). This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io(). To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().
Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.
- The sync task restructuring commit
13fe019870 removed the code
that bumpted dmu_tx_quota. The counter is now bumped in two
cases, instead of just the one case as before (after the result
of dsl_dataset_check_quota call). The second case is where
we check the requested reservation against the actual pool size,
as this is an implicit quota of sorts.
- The write throttle restructuring commit
e8b96c6007 makes dmu_tx_how and
dmu_tx_inflight obsolete, so they are removed.
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1914
Userland tools such as blkid, grub2-probe and zdb will go through the
buffer cache. However, ZFS uses on submit_bio() to bypass the buffer
cache when performing IO operations on vdevs for efficiency purposes.
This permits the on-disk state and buffer cache to fall out of
synchronization. That causes seemingly random failures when tools
reading stale metadata from the buffer cache try to access references to
data that is no longer there.
A particularly bad failure this causes involves grub2-probe, which is
used by grub2-mkconfig. Ordinarily, a rootfs might be called
rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe,
grub2-mkconfig will generate a configuration file containing
/ROOT/gentoo, which omits the pool name and causes a boot failure.
This is avoidable by calling invalidate_bdev() on each flush, which is a
simple way to ensure that all non-dirty pages are wiped. Since userland
tools rarely access vdevs directly, this should be a fancy noop >99.999%
of the time and have little impact on IO. We could have tried a finer
grained approach for the rare instances in which the vdevs are accessed
frequently by userland. However, that would require consideration of
corner cases and it is not worth the effort.
Memory-wise, it would have been better to use a Linux kernel API hook to
disable the buffer cache on such devices, but it provides us no way of
doing that, so we opt for this approach instead. We should revisit that
idea in the future when higher priority issues have been tackled.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2150
4574 get_clones_stat does not call zap_count in non-debug kernel
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/4574illumos/illumos-gate@03d1795fa6
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2147
The length (number of integers) argument passed to zap_lookup was wrong;
likely as a result of performing stack-reduction on the function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2141
Remove recursion from dsl_dir_willuse_space() to reduce stack usage.
Issues with stack overflow were observed in zfs recv of zvols,
likelihood of an overflow is proportional to the depth of the dataset
as dsl_dir_willuse_space() recurses to parent datasets.
Signed-off-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2069
Unfortunately, this change is an cheap attempt to work around a
pathological workload for the ARC. A "real" solution still needs to be
fleshed out, so this patch is intended to alleviate the situation in the
meantime. Let me try and describe the problem..
Data buffers residing in the dbuf hash table (dbuf cache) will keep a
hold on their respective dnode, this dnode will in turn keep a hold on
its backing dbuf (the physical block of the dnode object backing it).
Since the dnode has a hold on its backing dbuf, the arc buffer for this
dbuf is unevictable. What this essentially boils down to, "data" buffers
have the potential to pin "metadata" in the arc (as a result of these
dnode object buffers being unevictable).
This scenario becomes a real problem when the workload consists of many
small files (e.g. creating millions of 4K files). With this workload,
the arc's "arc_meta_used" space get filled up with buffers for any
resident directories as well as buffers for the objset's dnode object.
Once the "arc_meta_limit" is reached, the directory buffers will be
evicted and only the unevictable dnode object buffers will reside. If
the workload is simply creating new small files, these dnode object
buffers will never even be needed again, whereas the directory buffers
will be used constantly until the creates move to a new directory.
If "arc_c" and "arc_meta_limit" are sized appropriately, this
situation wont occur. This is because as the data buffers accumulate,
"arc_size" will eventually approach "arc_c" (before "arc_meta_used"
reaches "arc_meta_limit"); at that point the data buffers will be
evicted, which releases the hold on the dnode, which releases the hold
on the dnode object's dbuf, which allows that buffer to be evicted from
the arc in preference to more "useful" metadata.
So, to side step the issue, we simply need to ensure "arc_size" reaches
"arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to
pick a proper limit, we have to do some math.
To make things a little easier to follow, it is assumed that there will
only be a single data buffer per file (which is probably always the case
for "small" files anyways).
Based on the current internals of the arc, if N files residing in the
dbuf cache all pin a single dnode buffer (i.e. their dnodes all share
the same physical dnode object block), then the following amount of
"arc_meta_used" space will be consumed:
- 16K for the dnode object's block - [ 16384 bytes]
- N * sizeof(dnode_t) -------------- [ N * 928 bytes]
- (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes]
- (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes]
- (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes]
To simplify, these N files will pin the following amount of
"arc_meta_used" space as unevictable:
Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280)
Pinned "arc_meta_used" bytes = 17000 + N * 1544
This pinned space is regardless of the size of the files, and is only
dependent on the number of pinned dnodes sharing a physical block
(i.e. N). For example, 32 512b files sharing a single dnode object
block would consume the same "arc_meta_used" space as 32 4K files
sharing a single dnode object block.
Now, given a files size of S, we can determine the total amount of
space that will be consumed in the arc:
Total = 17000 + N * 1544 + S * N
^^^^^^^^^^^^^^^^ ^^^^^
metadata data
So, given these formulas, we can generate a table which states the ratio
of pinned metadata to total arc (meta + data) using different values of
N (number of pinned dnodes per pinned physical dnode block) and S (size
of the file).
File Sizes (S)
| 512 | 1024 | 2048 | 4096 | 8192 | 16384 |
---+----------+----------+----------+----------+----------+----------+
1 | 0.973132 | 0.947670 | 0.900544 | 0.819081 | 0.693597 | 0.530921 |
2 | 0.951497 | 0.907481 | 0.830632 | 0.710325 | 0.550779 | 0.380051 |
N 4 | 0.918807 | 0.849809 | 0.738842 | 0.585844 | 0.414271 | 0.261250 |
8 | 0.877541 | 0.781803 | 0.641770 | 0.472505 | 0.309333 | 0.182965 |
16 | 0.835819 | 0.717945 | 0.559996 | 0.388885 | 0.241376 | 0.137253 |
32 | 0.802106 | 0.669597 | 0.503304 | 0.336277 | 0.202123 | 0.112423 |
As you can see, if we wanted to support the absolute worst case of 1
dnode per physical dnode block and 512b files, we would have to set the
"arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At
that point, it essentially defeats the purpose of having an
"arc_meta_limit" at all.
This patch changes the default value of "arc_meta_limit" to be 75% of
"arc_c_max", which should be good enough for "most" workloads (I think).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.
To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer; unless it's at the
"metadata" limit. In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.
This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".
Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.
In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
This reverts commit c11a12bc3b.
Out of memory events were fixed by reverting this patch.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm disabling by default
until its usefulness is better understood.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "metadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).
What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:
commit 302f753f16
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue Mar 13 14:29:16 2012 -0700
Integrate ARC more tightly with Linux
Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.
This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.
In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.
1) Large systems with over a 1TB of memory are being deployed
and reserving 1/32 of this memory (32GB) as the mimimum
requirement is overkill.
2) Tiny systems like the raspberry pi may only have 256MB of
memory in which case 64MB is far too large.
The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose. If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue #2110
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit. Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.
The additional field has been removed by the previous commit. However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied. This will return the pools to a state in which they
may be imported by other implementations.
The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted. A message similar to one of the following means your
pool must be scrubbed to restore compatibility.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
however there is a compatibility issue which should be corrected
by running 'zpool scrub'
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
$ zpool status
pool: zol-0.6.2-173
state: ONLINE
scan: pool compatibility issue detected.
see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...
If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported. Further advice on how to proceed will be
provided by the error message as follows.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
active asynchronous destroy. Revert to an earlier version and
allow the destroy to complete before updating.
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:
zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w
Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.
The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys. Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.
This patch does have one limitation that should be mentioned. It will not
detect errata #2 on a pool unless errata #1 is also present. It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.
End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL. The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root. This will display a non-zero
value for any pool with an active asynchronous destroy.
Lastly, it is expected that no user data has been lost as a result of
this erratum.
Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Commit 1421c89142 added a field to
zbookmark_t that unintentinoally caused a disk format change. This
negatively affected backward compatibility and platform portability.
Therefore, this field is being removed.
The function that field permitted is left unimplemented until a later
patch that will reimplement the field in a way that does not affect the
disk format.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
Various errors can occur when registering property callbacks. As the
author's comments indicate, the code is very paranoid about preserving
the first-seen error when registering callbacks. This patch causes an
error seen while registering the "relatime" callback to not clobber a
previously-seen error.
Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2117
Commit e0b0ca9 accidentally corrupted the l2_asize displayed in
arcstats. This was caused by changing the l2arc_buf_hdr.b_asize
member from an int to uint32_t type. There are places in the
code where this field is cast to a uint64_t resulting in the
b_hits member being treated as part of b_asize.
To resolve the issue the type has been changed to a uint64_t,
and the b_hits member is placed after the enum to prevent the
size of the structure from increasing.
This is a good example of exactly why it's a bad idea to use
ambiguous types (int) in these structures.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1990
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Refences:
https://www.illumos.org/issues/4188illumos/illumos-gate@bb411a08b0
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2091
Add the "relatime" property. When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access. This behavior
is compatible with the Linux "relatime" mount option.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2064Closes#1917
When transitioning current open TXG into QUIESCE state and opening
a new one txg_quiesce() calls gethrtime():
- to mark the birth time of the new TXG
- to record the SPA txg history kstat
- implicitely inside spa_txg_history_add()
These timestamps are practically the same, so that the first one
can be used instead of the other two. The only visible difference
is that inside spa_txg_history_add() the time spent in kmem_zalloc()
will be counted towards the opened TXG.
Since at this point the new TXG already exists (tx->tx_open_txg
has been already incremented) it is actually a correct accounting.
In any case this extra work is only happening when spa_txg_history
kstat is activated (i.e. zfs_txg_history > 0) and doesn't affect
the normal processing in any way.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Issue #2075
In several cases when digging into kstats we can found two txgs
in SYNC state, e.g.
txg birth state nreserved nread nwritten ...
985452 258127184872561 C 0 373948416 2376272384 ...
985453 258129016180616 C 0 378173440 28793344 ...
985454 258129016271523 S 0 0 0 ...
985455 258130864245986 S 0 0 0 ...
985456 258130867458851 O 0 0 0 ...
However only first txg (985454) is really syncing at this moment.
The other one (985455) marked as SYNCED is actually in a post-QUIESCED
state and waiting to start sync. So, the new TXG_STATE_WAIT_FOR_SYNC
state between TXG_STATE_QUIESCED and TXG_STATE_SYNCED was added to
reveal this situation.
txg birth state nreserved nread nwritten ...
1086896 235261068743969 C 0 163577856 8437248 ...
1086897 235262870830801 C 0 280625152 822594048 ...
1086898 235264172219064 S 0 0 0 ...
1086899 235264936134407 W 0 0 0 ...
1086900 235264936296156 O 0 0 0 ...
Signed-off-by: Igor Lvovsky <ilvovsky@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2075
From the comment in the commit:
Some ZFS implementations (ZEVO) create neither a ZNODE_ACL nor a DACL_ACES
SA in which case ENOENT is returned from zfs_acl_node_read() when the
SA can't be located. Allow chown/chgrp to succeed in these cases rather
than returning an error that makes no sense in the context of the caller.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfs-osx/zfs#86
Closes#1911Closes#2029
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on. In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1928
Under Linux the zvol_set_volsize() function was originally written
to use dmu_objset_hold()/dmu_objset_rele(). Subsequently, the
dmu_objset_own()/dmu_objset_disown() interfaces were added but
the ZVOL code wasn't updated to take advantage of them.
This was never an issue but after the dsl_pool_config changes
the code now takes the config lock twice. The cleanest solution
is to shift to using dmu_objset_own() which takes a long hold
on the dataset and does not hold the dsl pool lock.
This patch also slightly restructures the existing code such
that it more closely resembles the upstream Illumos code.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2039
It's unsafe to drain the iput taskq while holding the z_teardown_lock
as a writer. This is because when the last reference on an inode is
dropped it may still have pages which need to be written to disk.
This will be done through zpl_writepages which will acquire the
z_teardown_lock as a reader in ZFS_ENTER. Therefore, if we're
holding the lock as a writer in zfs_sb_teardown the unmount will
deadlock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#1988
This change was proposed for Sparc but it's not clear to me
why it's required. Proper support exists in the lz4 code to
detect the endianness and the required builtins are available
for gcc. Still I'm including the patch because it will only
impact Sparc and it may resolve a case which hasn't occured
to me.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
On Sparc sp->blksize will be a 64-bit value which is then cast
incorrectly to a 32-bit value. For big endian systems this
results in an incorrect value for sp->blksize. To resolve the
problem local variables of the correct size are used and then
assigned to sp->blksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
The mis-aligned memory accesses in nvpair_native_embedded() and
nvpair_native_embedded_array() will cause a 'Bus Error' for
architectures such as Sparc which not fully byte addressible.
To avoid this issue care is taken to avoid dereferencing the
potentially mis-aligned packed nvlist_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
When accessing the zp->z_mode through the SA bulk interface we
expect that 64-bits are available to hold the result. However,
on 32-bit platforms mode_t will only be 32-bits so we cannot
pass it to SA_ADD_BULK_ATTR(). Instead a local uint64_t variable
must be used and the result assigned to zp->z_mode.
This went unnoticed on 32-bit little endian platforms because
the bytes happen to end up in the correct 32-bits. But on big
endian platforms like Sparc the zp->z_mode will always end up
set to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
Back the allocations for ddt tables+entries and l2arc headers with
kmem caches. This will reduce the cost of allocating these commonly
used structures and allow for greater visibility of them through the
/proc/spl/kmem/slab interface.
Signed-off-by: John Layman <jlayman@sagecloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1893
Move the libzfs_fini() after the zpool_log_history() call so the
ZPOOL_HIST_CMD entry can get written.
Fix the handling of saved_poolname in zfsdev_ioctl()
which was broken as part of the stack-reduction work in
a168788053.
Since ZoL destroys the TSD data in which the previously successful
ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY
ioctl has a very important restriction: it can only successfully write
a long entry following a successful ioctl() if no intervening vops have
been performed. Some of zfs subcommands do perform intervening vops and
to do the logging themselves. At the moment, the "create" and "clone"
subcommands have been modified appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1998
During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.
Previously, a size-changing update of a variably-sized SA that occurred
when there were other variably-sized SAs in the bonus buffer would cause
the subsequent SAs to be corrupted. The most common case in which this
would occur is when a mode change caused the ZPL_DACL_ACES entry to
change size when a ZPL_DXATTR (SA xattr) entry already existed.
The following sequence would have caused a failure when xattr=sa was in
force and would corrupt the bonus buffer:
open(filename, O_WRONLY | O_CREAT, 0600);
...
lsetxattr(filename, ...); /* create xattr SA */
chmod(filename, 0650); /* enlarges the ACL */
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1978
This change is related to commit 81eaf15 which ensured the correct
allocation handlers were installed for nvlist_alloc(). The nvlist
functions nvlist_dup(), nvlist_pack(), and nvlist_unpack() suffer
from the same issue and have been updated accordingly.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1937
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.
Here is an example of what's wrong:
Process A B
open(O_SYNC)
z_sync_cnt = 1
open(O_SYNC)
z_sync_cnt = 2
close()
z_sync_cnt = 0
So z_sync_cnt is 0 even if B still has the file with O_SYNC.
Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented. This
is safe because generic_file_open has no side effects.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962
Update zvol.c to conform to the style guidelines, verified by
running cstyle.pl on the source file. This patch contains
no functional changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #1821
In order to minimize any future disruption caused by the addition
and removal /dev/zfs ioctls this patch makes the following changes.
1) Sync ZoL's ioctl ordering such that it matches Illumos. For
historic reasons the ZFS_IOC_DESTROY_SNAPS and ZFS_IOC_POOL_REGUID
ioctls were out of order.
2) Move Linux and FreeBSD specific ioctls in to their own reserved
ranges. This allows us to preserve the existing ordering when
new ioctls are added by either Illumos or FreeBSD. When an
ioctl is no longer needed it should be retired in place.
This change alters the ZFS user/kernel ABI so make sure you rebuild
both your user and kernel modules. However, it should allow for a
much stabler interface going forward.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1973
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
References:
illumos/illumos-gate@681d9761e8
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#1969
4121 vdev_label_init should treat request as succeeded when pool
is read only
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/4121illumos/illumos-gate@973c78e94b
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1863
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime. However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.
This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.
It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.
The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.
The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.
Signed-off-by: david.chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1941
This reverts commit 6a7c0ccca4.
A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1648
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header. This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.
The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode. The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.
A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648). Note the
corrupt link target name:
$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec 6 15:40 link -> 90123456701234567
Commit 6a7c0ccca4 worked around this bug
by forcing xattr's on symlinks to be stored in directory format. This
change implements a proper fix, so the workaround can now be reverted.
The reference link below contains a reproducer for FreeBSD.
References:
http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1890
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented. Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.
However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist. Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.
There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist. This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.
This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes#1426Closes#481
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion". See:
/sys/module/zavl/version
/sys/module/zavl/srcversion
/sys/module/znvpair/version
/sys/module/znvpair/srcversion
/sys/module/zunicode/version
/sys/module/zunicode/srcversion
/sys/module/zcommon/version
/sys/module/zcommon/srcversion
/sys/module/zfs/version
/sys/module/zfs/srcversion
/sys/module/zpios/version
/sys/module/zpios/srcversion
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1923
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045illumos/illumos-gate@69962b5647
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1913
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.
Porting Notes:
zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4347illumos/illumos-gate@e722410c49
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them. This is always true for the standard unmount
case but there are two other cases where it is not strictly true.
* zfs_ioc_rollback() - This is the most common case and it results
from the fact that we aren't unmounting the filesystem. During a
normal unmount the MS_ACTIVE flag will be cleared on the super block
causing iput_final() to evict the inode when its reference count
drops to zero. However, during a rollback MS_ACTIVE remains set
since we're rolling back a live filesystem and need to preserve the
existing super block. This allows inodes with a zero reference count
to stay in the cache thereby violating the assertion.
* destroy_inode() / zfs_sb_teardown() - There exists a small race
between dropping the last reference on an inode and removing it from
the zsb->z_all_znodes list. This is unlikely to occur but could also
trigger the assertion which is incorrect. The inode may safely have
a zero reference count in this case.
Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT. This code is only enabled for default builds so removing
this entirely is a very safe change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#1417Closes#1536
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit
65c67ea86e.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1852Closes#1855
Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1849Closes#907
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free. However, the Linux kernel does provide helper
functions allow us to perform our own accounting. These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#313Closes#1275
Under Linux this restriction does not apply because we have access
to all the required devices.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1631
Properly initialize SELinux xattrs for all inode types. The
initial implementation accidentally only did this for files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1832
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp(). This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath. To minimize the stack usage
for this call path the following changes were made:
1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
to prevent them from being inlined by gcc. This reduced the
stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
vdev_mirror_io_start() from 144 to 128 bytes.
2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
from the stack to the heap. This reduced the stack usage of
zfsdev_ioctl() from 368 to 112 bytes.
3) The major saving came from slimming down traverse_visitbp() from
from 224 to 144 bytes. Since this function is called recursively
the 80 bytes saved per invokation adds up. The following changes
were made:
a) The 'hard' local variable was replaced by a TD_HARD() macro.
b) The 'pd' local variable was replaced by 'td->td_pfd' references.
c) The zbookmark_t was moved to the heap. This does cost us an
additional memory allocation per recursion by that cost should
still be minimal. The cost could be further reduced by adding
a dedicated zbookmark_t slab cache.
d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
restructured to use the minimum amount of stack. This includes
removing the 'cbp' local variable.
Overall for the offending use case roughly 1584 of total stack space
has been saved. This is enough to avoid overflowing the stack on
stock kernels with 8k stacks. See #1778 for additional details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1778
Commit 95fd54a1c5 restructured the
hold/release processing and moved some of the work into the sync task.
A number of nvlist allocations now need to use KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1852Closes#1855
The Illumos #3875 patch reverted a part of ZoL's 7b3e34b which added
special-case error handling for zfs_rezget(). The error handling dealt
with the case in which an all-ones object number ended up being passed
to dnode_hold() and causing an EINVAL to be returned from zfs_rezget().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1859Closes#1861
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1819
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined. Therefore, only enable Posix
ACL support for these kernels. All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.
If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.
"ZFS: Posix ACLs disabled by kernel"
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1825
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context. These were accidentally introduced
by the recent set of Illumos patches. The solution is to
switch to KM_PUSHPAGE.
dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() ->
kmem_alloc(sizeof (*snap), KM_SLEEP);
dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() ->
kmem_alloc(sizeof (*ca), KM_SLEEP)
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3995 Memory leak of compressed buffers in l2arc_write_done
References:
https://illumos.org/issues/3995
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1688
Issue #1775
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4168https://www.illumos.org/issues/4169https://www.illumos.org/issues/4170illumos/illumos-gate@7fdd916c47
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/4082illumos/illumos-gate@5253393b09
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4046illumos/illumos-gate@b62969f868
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. This commit removed dsl_dataset_namelen in Illumos, but that
appears to have been removed from ZFSOnLinux in an earlier commit.
4047 panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4047illumos/illumos-gate@713d6c2088
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The exported symbol dmu_free_object() was renamed to
dmu_free_long_object() in Illumos.
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3996illumos/illumos-gate@a7027df17f
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3956https://www.illumos.org/issues/3957https://www.illumos.org/issues/3958https://www.illumos.org/issues/3959https://www.illumos.org/issues/3960https://www.illumos.org/issues/3961https://www.illumos.org/issues/3962illumos/illumos-gate@b4952e17e8
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
1. zfs_dbgmsg_print() is only used in userland. Since we do not have
mdb on Linux, it does not make sense to make it available in the
kernel. This means that a build failure will occur if any future
kernel patch depends on it. However, that is unlikely given that
this functionality was added to support zdb.
2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
This preserves the existing behavior of minimal noise when running
with -V, and -VV.
3. In vdev_config_generate() the call to nvlist_alloc() was not
changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
the txg_sync context.
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3949https://www.illumos.org/issues/3950https://www.illumos.org/issues/3951https://www.illumos.org/issues/3952illumos/illumos-gate@2c1e2b4414
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The deadman thread was removed from ztest during the original
port because it depended on Solaris thr_create() interface.
This functionality should be reintroduced using the more
portable pthreads.
3973 zfs_ioc_rename alters passed in zc->zc_name
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3973illumos/illumos-gate@a0c1127b14
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3834 incremental replication of 'holey' file systems is slow
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3834illumos/illumos-gate@ca48f36f20
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3836 zio_free() can be processed immediately in the common case
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3836illumos/illumos-gate@9cb154a3c9
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3112 ztest does not honor ZFS_DEBUG
3113 ztest should use watchpoints to protect frozen arc bufs
3114 some leaked nvlists in zfsdev_ioctl
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Amdur <Matt.Amdur@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
https://www.illumos.org/issues/3112https://www.illumos.org/issues/3113https://www.illumos.org/issues/3114illumos/illumos-gate@cd1c8b85eb
The /proc/self/cmd watchpoint interface is specific to Solaris.
Therefore, the #3113 implementation was reworked to use the more
portable mprotect(2) system call. When the pages are watched they
are marked read-only for protection. Any write to the protected
address range immediately trigger a SIGSEGV. The pages are marked
writable again when they are unwatched.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
3236 zio nop-write
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@80901aea8ehttps://www.illumos.org/issues/3236
Porting Notes
1. This patch is being merged dispite an increased instance of
https://www.illumos.org/issues/3113 being triggered by ztest.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
3875 panic in zfs_root() after failed rollback
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/3875illumos/illumos-gate@91948b51b8
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3888 zfs recv -F should destroy any snapshots created since
the incremental source
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3888illumos/illumos-gate@34f2f8cf94
Porting notes:
1. Commit 1fde1e3720 wrapped a
declaration in dsl_dataset_modified_since_lastsnap in ASSERTV().
The ASSERTV() and local variable have been removed to avoid an
unused variable warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Richard Yao <ryao@gentoo.org>
Issue #1775
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/3894illumos/illumos-gate@ca48f36f20
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3829illumos/illumos-gate@bb6e70758d
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3740illumos/illumos-gate@a7a845e4bf
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. 13fe019870 introduced a merge conflict
in dsl_dataset_user_release_tmp where some variables were moved
outside of the preprocessor directive.
2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
because this commit refactors the code, adding a new KM_SLEEP
allocation. It is not clear to me whether this should be converted
to KM_PUSHPAGE.
3. We had a merge conflict in libzfs_sendrecv.c because of copyright
notices.
4. Several small C99 compatibility fixed were made.
3744 zfs shouldn't ignore errors unmounting snapshots
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3744illumos/illumos-gate@fc7a6e3fef
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. There is no clear way to distinguish between a failure when we
tried to unmount the snapdir of a zvol (which does not exist)
and the failure when we try to unmount a snapdir of a dataset,
so the changes to zfs_unmount_snap() were dropped in favor of
an altered Linux function that unconditionally returns 0.
3743 zfs needs a refcount audit
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3743illumos/illumos-gate@b287be1ba8
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3742illumos/illumos-gate@f717074149
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The change to zfs_vfsops.c was dropped because it involves
zfs_mount_label_policy, which does not exist in the Linux port.
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3741illumos/illumos-gate@3e30c24aee
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3699https://www.illumos.org/issues/3739illumos/illumos-gate@013023d4ed
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3582illumos/illumos-gate@0689f76
Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
References:
illumos/illumos-gate@44bffe012c
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
References:
illumos/illumos-gate@d39ee142a9
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. This commit was so old that only two lines applied to the modern
code base.
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/3642https://www.illumos.org/issues/3643illumos/illumos-gate@4a92375985
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting Notes:
1. The alignment assumptions for the tx_cpu structure assume that
a kmutex_t is 8 bytes. This isn't true under Linux but tc_pad[]
was adjusted anyway for consistency since this structure was
never carefully aligned in ZoL. If careful alignment does impact
performance significantly this should be reworked to be portable.
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3645https://www.illumos.org/issues/3692illumos/illumos-gate@de8d9cff56
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1792
Issue #1775
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3598illumos/illumos-gate@be6fd75a69
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. include/sys/zfs_context.h has been modified to render some new
macros inert until dtrace is available on Linux.
2. Linux-specific changes have been adapted to use SET_ERROR().
3. I'm NOT happy about this change. It does nothing but ugly
up the code under Linux. Unfortunately we need to take it to
avoid more merge conflicts in the future. -Brian
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3517illumos/illumos-gate@efb4a871d8
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3603https://www.illumos.org/issues/3604https://www.illumos.org/issues/3871illumos/illumos-gate@d04756377d
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.
3588 provide zfs properties for logical (uncompressed) space
used and referenced
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3588illumos/illumos-gate@77372cb0f3
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3578https://www.illumos.org/issues/3579illumos/illumos-gate@9eb57f7f3f
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3561 arc_meta_limit should be exposed via kstats
3116 zpool reguid may log negative guids to internal SPA history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3561https://www.illumos.org/issues/3116illumos/illumos-gate@20128a0826
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
1. The spa change was accidentally included in the libzfs_core merge.
2. "Add missing arcstats" (1834f2d8b7)
already implemented these kstats a few years ago.
3537 want pool io kstats
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
http://www.illumos.org/issues/3537illumos/illumos-gate@c3a6601
Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
1. The patch was restructured to take advantage of the existing
spa statistics infrastructure. To accomplish this the kstat
was moved in to spa->io_stats and the init/destroy code moved
to spa_stats.c.
2. The I/O kstat was simply named <pool> which conflicted with the
pool directory we had already created. Therefore it was renamed
to <pool>/io
3. An update handler was added to allow the kstat to be zeroed.
3522 zfs module should not allow uninitialized variables
Reviewed by: Sebastien Roy <seb@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3522illumos/illumos-gate@d5285cae91
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
1. ZFSOnLinux had already addressed many of these issues because of
its use of -Wall. However, the manner in which they were addressed
differed. The illumos fixes replace the ones previously made in
ZFSOnLinux to reduce code differences.
2. Part of the upstream patch made a small change to arc.c that might
address zfsonlinux/zfs#1334.
3. The initialization of aclsize in zfs_log_create() differs because
vsecp is a NULL pointer on ZFSOnLinux.
4. The changes to zfs_register_callbacks() were dropped because it
has diverged and needs to be resynced.
This resolves merge conflicts when merging Illumos #3588 and Illumos #4047.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Modifying the length of a string returned by strdup() is incorrect
because strfree() is allowed to use strlen() to determine which slab
cache was used to do the allocation.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
The resolution of a merge conflict when merging Illumos #3464 caused us
to invert the order couple of function calls in zio_free_sync() versus
what they are in Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
This was accidentally removed by overzealous commenting.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems. Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set. The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.
By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property. Set the property to
'posixacl' to enable them. If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.
This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).
http://www.tuxera.com/community/posix-test-suite/
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#170
Attempting to remove an xattr from a file which does not contain
any directory based xattrs would result in the xattr directory
being created. This behavior is non-optimal because it results
in write operations to the pool in addition to the expected error
being returned.
To prevent this the CREATE_XATTR_DIR flag is only passed in
zpl_xattr_set_dir() when setting a non-NULL xattr value. In
addition, zpl_xattr_set() is updated similarly such that it will
return immediately if passed an xattr name which doesn't exist
and a NULL value.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170
This does the following:
1. It creates a uint8_t type value, which is initialized to DT_DIR on
dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise.
This resolves a regression where we return unintialized values as the
directory entry type on dot directories. This was accidentally
introduced by commit 8170d28126.
2. It restructures zfs_readdir() code to use `uint64_t offset` like
Illumos instead of `loff_t *pos`. This resolves a regression where
negative ZAP cursors were treated as if they were dot directories.
3. It restructures the function to more closely match the structure of
zfs_readdir() on Illumos and removes the unused variable outcount, which
was only used on Illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1750
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.
For each txg, the following information is exported:
* Unique txg number (uint64_t)
* The time the txd was born (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
* The number of reserved bytes for the txg (uint64_t)
* The number of bytes read during the txg (uint64_t)
* The number of bytes written during the txg (uint64_t)
* The number of read operations during the txg (uint64_t)
* The number of write operations during the txg (uint64_t)
* The time the txg was closed (hrtime_t)
* The time the txg was quiesced (hrtime_t)
* The time the txg was synced (hrtime_t)
Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times. Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1789
The semantics introduced by the restructured sync task of illumos
3464 require this lock when calling dmu_snapshot_list_next().
The pool is locked/unlocked for each iteration to reduce the
chance of long-running locks.
This was accidentally missed when doing the original port because
ZoL's control directory code is Linux-specific and is in a
different file than in illumos.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1785
3552 condensing one space map burns 3 seconds of CPU in spa_sync()
thread (fix race condition)
References:
https://www.illumos.org/issues/3552illumos/illumos-gate@03f8c36688
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@e51be06697, which
ported the Illumos 3552 changes. This fix was added to upstream
rather quickly, but at the time of the port, no one spotted it and
the race was rare enough that it passed our regression tests. I
discovered this when comparing our metaslab.c to the illumos
metaslab.c.
Without this change it is possible for metaslab_group_alloc() to
consume a large amount of cpu time. Since this occurs under a
mutex in a rcu critical section the kernel will log this to the
console as a self-detected cpu stall as follows:
INFO: rcu_sched self-detected stall on CPU { 0}
(t=60000 jiffies g=11431890 c=11431889 q=18271)
Closes#1687Closes#1720Closes#1731Closes#1747
These are needed by consumers (i.e. Lustre) who wish to use the
dsl_prop_register() interface to register callbacks when pool
properties of interest change. This interface requires that the
DSL pool configuration lock is held when called.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1762
When building the spl with --enable-debug-kmem-tracking a memory
leak is detected in log_internal(). This happens to be a false
positive because the memory was freed using strfree() instead of
kmem_free(). All kmem_alloc()'s must be released with kmem_free()
to ensure correct accounting.
SPL: kmem leaked 135/5641311 bytes
address size data func:line
ffff8800cba7cd80 135 ZZZZZZZZZZZZZZZZ log_internal:456
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The recent sync task restructuring in 13fe019 introduced several
new symbols which should be exported for use by consumers such
as Lustre.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some ZFS errors such as certain snapshot failures can occur in
the sync task context. Because they may require additional memory
allocations, the initial nvlist must be allocated with KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
A handful of allocations now occur in the sync path and need
to use KM_PUSHPAGE. These were introduced by commit 13fe019.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
The spa_deadman() and spa_sync() functions can both be run in the
spa_sync context and therefore should use TQ_PUSHPAGE instead of
TQ_SLEEP.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1734Closes#1749
Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:
* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
while the read is made) the worst case would be that a single
read could be queued on a mirror side which more busy than thought
The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1739
dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen
this happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.
See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:
Depth Size Location (42 entries)
----- ---- --------
0) 11192 72 __kmalloc+0x2e/0x240
1) 11120 144 kmem_alloc_debug+0x20e/0x500
2) 10976 72 dbuf_hold_impl+0x4a/0xa0
3) 10904 120 dbuf_prefetch+0xd3/0x280
4) 10784 80 dmu_zfetch_dofetch.isra.5+0x10f/0x180
5) 10704 240 dmu_zfetch+0x5f7/0x10e0
6) 10464 168 dbuf_read+0x71e/0x8f0
7) 10296 104 dnode_hold_impl+0x1ee/0x620
8) 10192 16 dnode_hold+0x19/0x20
9) 10176 88 dmu_buf_hold+0x42/0x1b0
10) 10088 144 zap_lockdir+0x48/0x730
11) 9944 128 zap_cursor_retrieve+0x1c4/0x2f0
12) 9816 392 dsl_dataset_remove_clones_key.isra.14+0xab/0x190
13) 9424 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
14) 9032 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
15) 8640 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
16) 8248 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
17) 7856 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
18) 7464 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
19) 7072 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
20) 6680 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
21) 6288 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
22) 5896 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
23) 5504 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
24) 5112 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
25) 4720 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
26) 4328 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
27) 3936 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
28) 3544 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
29) 3152 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
30) 2760 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
31) 2368 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
32) 1976 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
33) 1584 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
34) 1192 232 dsl_dataset_destroy_sync+0x311/0xf60
35) 960 72 dsl_sync_task_group_sync+0x12f/0x230
36) 888 168 dsl_pool_sync+0x48b/0x5c0
37) 720 184 spa_sync+0x417/0xb00
38) 536 184 txg_sync_thread+0x325/0x5b0
39) 352 48 thread_generic_wrapper+0x7a/0x90
40) 304 128 kthread+0xc0/0xd0
41) 176 176 ret_from_fork+0x7c/0xb0
This change reduces the stack usage in dsl_dataset_remove_clones_key
by allocating structures in heap, not in stack. This is not a fundamental
fix, as one can create an arbitrary large data set that runs over any
fixed size stack, but this will make the problem far less likely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Closes#1726
The zpl_mknod() function was incorrectly negating its return value.
This doesn't cause any problems in the success case, but it does
prevent us from returning the correct error code for a failure.
The implementation of this function is now consistent with all
the other zpl_* functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1717
When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1716
After the restructuring in 13fe019 The 'zfs rename' command will
result in a KM_SLEEP being called in the sync context. This may
deadlock due to reclaim so it was changed to KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1711
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/2882https://www.illumos.org/issues/2883https://www.illumos.org/issues/2900illumos/illumos-gate@4445fffbbb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.
Fortunately, the only existing use case for this are symlinks with
SA based xattrs. Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468
This reverts commit fadd0c4da1 which
introduced a regression in honoring the meta limit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close#1660
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1653
Issue #1591
Because we need to be more frugal about our stack usage under
Linux. The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy. Those two contexts are the sync
thread and what ever thread is performing spa initialization.
Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed. If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again. The system
will get stuck spinning re-dispatching the zio and making no
forward progress.
To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.
In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1455
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
http://www.illumos.org/issues/3618illumos/illumos-gate@c55e05cb35
Notes on porting to ZFS on Linux:
The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.
One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:
zfs_vdev_time_shift = 6
to
zfs_vdev_time_shift = 29
The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.
(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)
Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1643
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.
We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1589
When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists. Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.
To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.
If this is insufficient we request that the VFS release
some inodes and dentries. This will result in the release
of some dnodes which are counted as 'other' metadata.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The default behavior of arc_evict_ghost() is to start by evicting
data buffers. Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.
This is ideal for honoring arc_c since it's preferable to keep the
meta data cached. However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.
To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with. The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.
All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change. New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@aad02571bchttps://www.illumos.org/issues/3137http://wiki.illumos.org/display/illumos/L2ARC+Compression
Notes for Linux port:
A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set. This allows the legacy behavior
to be preserved.
Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1379
This is analogous to SPL commit zfsonlinux/spl@b9b3715. While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1579
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.
On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop. The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.
Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.
Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1624
3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
illumos/illumos-gate@fb02ae0252https://www.illumos.org/issues/3639
Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:
$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C
This is a problem because if the machine reboots we import all pools in
the cache file as read-write.
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the property atime=on is set operations which only access
and inode do cause an atime update. However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache. Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.
This behavior may in part explain why zfs's reclaim logic has
been observed to be slow. When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.
Obviously we don't want to force a write to disk for every
atime update, these needs to be batched. The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks. However, to do that right requires proper unification
of some fields in the znode/inode. Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.
Until that work gets done we have to settle for some middle
ground. The simplest and safest thing we can do for now is
to write the dirty inode on last close. This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.
Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results. This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.
w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch: ~3s to drop 500,000 inodes with drop_caches.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
dmu_tx_hold_object_impl can return NULL on error. Check for this
condition prior to dereferencing pointer. This can only occur if
the passed object was invalid or unallocated.
Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1610
The code involving b_thawed appears to be dead, so lets discard it.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
We declare zio_alloc_arena using extern, but it does not appear to exist
anywhere in the code. This permits undefined behavior, so lets remove
it.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
The l2arc module options can be made safely writable. This allows
the options to be changed without unloading/loading the modules.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These days modern SSDs can efficiently service concurrent reads
and writes. When this flag was added that wasn't really the
case for a variety of SSD controllers. But now we can set the
default value to take advantage of this parallelism and only
disable this as needed for specific troublesome hardware.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type. Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.
This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself. If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.
The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node. However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used. Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1045Closes#861Closes#790
During mount a filesystem dataset would have the MS_RDONLY bit
incorrectly cleared even if the entire pool was read-only.
There is existing to code to handle this case but it was being run
before the property callbacks were registered. To resolve the
issue we move this read-only code after the callback registration.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1338
It is possible for an automounted snapshot which is expiring to
deadlock with a manual unmount of the snapshot. This can occur
because taskq_cancel_id() will block if the task is currently
executing until it completes. But it will never complete because
zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which
zfsctl_expire_snapshot() must acquire.
---------------------- z_unmount/0:2153 ---------------------
mutex_lock <blocking on zsb->z_ctldir_lock>
zfsctl_unmount_snapshot
zfsctl_expire_snapshot
taskq_thread
------------------------- zfs:10690 -------------------------
taskq_wait_id <waiting for z_unmount to exit>
taskq_cancel_id
__zfsctl_unmount_snapshot
zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock>
zfs_unmount_snap
zfs_ioc_destroy_snaps_nvl
zfsdev_ioctl
do_vfs_ioctl
We resolve the deadlock by dropping the zsb->z_ctldir_lock before
calling __zfsctl_unmount_snapshot(). The lock is only there to
prevent concurrent modification to the zsb->z_ctldir_snaps AVL
tree. Moreover, we're careful to remove the zfs_snapentry_t from
the AVL tree before dropping the lock which ensures no other tasks
can find it. On failure it's added back to the tree.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#1527
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.
The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror. It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them. In practice this results in the
leaf vdevs being under utilized.
Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO. This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror. Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.
In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds. Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.
The testing results show a significant performance improvement
for all four workloads tested. The workloads were generated
using the fio benchmark and are as follows.
1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).
| Pristine | With 1461 |
| Sequential Random | Sequential Random |
| 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB |
| MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped | 226 243 11 304 | 222 255 11 299 |
2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 |
2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 |
2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1461
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:
$ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
12 1 0x01 32 1536 19792068076691 20516481514522
name type data
1 us 4 859
2 us 4 252
4 us 4 171
8 us 4 2
16 us 4 0
32 us 4 2
64 us 4 0
128 us 4 0
256 us 4 0
512 us 4 0
1024 us 4 0
2048 us 4 0
4096 us 4 0
8192 us 4 0
16384 us 4 0
32768 us 4 1
65536 us 4 1
131072 us 4 1
262144 us 4 4
524288 us 4 0
1048576 us 4 0
2097152 us 4 0
4194304 us 4 0
8388608 us 4 0
16777216 us 4 0
33554432 us 4 0
67108864 us 4 0
134217728 us 4 0
268435456 us 4 0
536870912 us 4 0
1073741824 us 4 0
2147483648 us 4 0
one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1584
In the event that a pool gets suspended log this information to
the console. This is critical information and we want to make
sure it gets logged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1555
To avoid a potential deadlock when using a zvol as a swap
device prevent vdev_disk_io_flush() from performing IO during
the bio_alloc().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1508
When we remove references of arc bufs in the arc_anon state we
needn't take its header's hash_lock, so postpone it to where we
really need it to avoid unnecessary invocations of function buf_hash.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1557
There are times when it is desirable for zfs to not automatically
populate the spa namespace at module load time using the pools
in the /etc/zfs/zpool.cache file. The zfs_autoimport_disable
module option has been added to control this behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #330
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void. Detect the expected
prototype and defined our callout accordingly.
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1494
One of the side effects of calling zvol_create_minors() in
zvol_init() is that all pools listed in the cache file will
be opened. Depending on the state and contents of your pool
this operation can take a considerable length of time.
Doing this at load time is undesirable because the kernel
is holding a global module lock. This prevents other modules
from loading and can serialize an otherwise parallel boot
process. Doing this after module inititialization also
reduces the chances of accidentally introducing a race
during module init.
To ensure that /dev/zvol/<pool>/<dataset> devices are
still automatically created after the module load completes
a udev rules has been added. When udev notices that the
/dev/zfs device has been create the 'zpool list' command
will be run. This then will cause all the pools listed
in the zpool.cache file to be opened.
Because this process in now driven asynchronously by udev
there is the risk of problems in downstream distributions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #756
Issue #1020
Issue #1234
The following error will occur on some (possibly all) kernels
because blk_init_queue() will try to take the spinlock before
we initialize it.
BUG: spinlock bad magic on CPU#0, zpool/4054
lock: 0xffff88021a73de60, .magic: 00000000,
.owner: <none>/-1, .owner_cpu: 0
Pid: 4054, comm: zpool Not tainted 3.9.3 #11
Call Trace:
[<ffffffff81478ef8>] spin_dump+0x8c/0x91
[<ffffffff81478f1e>] spin_bug+0x21/0x26
[<ffffffff812da097>] do_raw_spin_lock+0x127/0x130
[<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30
[<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350
[<ffffffff812aacb8>] elevator_init+0x78/0x140
[<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0
[<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70
[<ffffffff812b271e>] blk_init_queue+0xe/0x10
[<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620
[<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30
[<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510
[<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510
[<ffffffff81253ea2>] zvol_create_minors+0x92/0x180
[<ffffffff811f8d80>] spa_open_common+0x250/0x380
[<ffffffff811f8ece>] spa_open+0xe/0x10
[<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80
[<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190
[<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0
[<ffffffff8116a950>] sys_ioctl+0x40/0x80
[<ffffffff814812c9>] ? do_page_fault+0x9/0x10
[<ffffffff81483929>] system_call_fastpath+0x16/0x1b
zd0: unknown partition table
We fix this by calling spin_lock_init before blk_init_queue.
The manner in which zvol_init() initializes structures is
suspectible to a race between initialization and a probe on
a zvol. We reorganize zvol_init() to prevent that.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There is an extremely odd bug that causes zvols to fail to appear on
some systems, but not others. Recently, I was able to consistently
reproduce this issue over a period of 1 month. The issue disappeared
after I applied this change from FreeBSD.
This is from FreeBSD's pool version 28 import, which occurred in
revision 219089.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #441
Issue #599
3122 zfs destroy filesystem should prefetch blocks
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@b4709335aahttps://www.illumos.org/issues/3122
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1565
Commit 55d85d5a8c (backport of
the upstream changes) replaced three hardcoded constants:
#define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */
#define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
#define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */
with a tunable parameters:
int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */
This commit makes these tunables available as module parameters
in Linux. They should only be used for performance analysis
because changing them can result in subtle and pathological
performance problems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1562
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.
Tested with xfstests 285 and 286. All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.
Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.
An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1384
This patch restores the zfs_holey() function from OpenSolaris.
This was removed by commit 3558fd7 because it wasn't clear we
had a use for it in ZoL. However, this functionality is a
prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #1384
By definitition these allocations will never fail. For
consistency with the rest of the code remove this dead error
handling code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1558
Fix a pair of conditions in which a concurrent umount can cause
NULL pointer dereferences:
* zfs_sb_teardown - prevent a NULL dereference by not calling
dmu_objset_pool with a null z_os.
* zfs_resume_fs - don't try to unmount with a null z_os. This
change makes the ZoL code more consistent
with both Illumos and FreeBSD.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1543
Previous commit 7ef5e54e2e caused
module probe failure on 32-bit systems, dmesg showed
Unknown symbol __moddi3
This was caused by the modulo operation 'gethrtime() % tqs->stqs_count'
in the committed code. Instead of implementing __moddi3 for all 32-bit
systems, Behlendorf advised we can just cast the return value of
gethrtime() into a uint64_t, since gethrtime does not return negative
value on all circumstances we need not care about the potential overflow.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1551
Until these hooks are fully implemented return the expected
-EOPNOTSUPP error to indicate they are not functional. This
allows test suites such as xfstests to cleanly skip testing
this functionality until it's implemented.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #229
The non-blocking allocation handlers in nvlist_alloc() would be
mistakenly assigned if any flags other than KM_SLEEP were passed.
This meant that nvlists allocated with KM_PUSHPUSH or other KM_*
debug flags were effectively always using atomic allocations.
While these failures were unlikely it could lead to assertions
because KM_PUSHPAGE allocations in particular are guaranteed to
succeed or block. They must never fail.
Since the existing API does not allow us to pass allocation
flags to the private allocators the cleanest thing to do is to
add a KM_PUSHPAGE allocator.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#249
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@6e6d5868f5https://www.illumos.org/issues/3805
ZFS should proactively evict freed blocks from the cache.
On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk. We were wasting about half the
system's RAM (252GB) on blocks that have been freed.
Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:
1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself). The secondary control is to
evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks. Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB. However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless. Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1503
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@16a4a80742https://www.illumos.org/issues/3552https://www.illumos.org/issues/3564
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1513
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
argument is zero
Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>
References:
illumos/illumos-gate@fb09f5aad4https://illumos.org/issues/3006
Requires:
zfsonlinux/spl@1c6d149feb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1509
When SA xattrs are enabled only fallback to checking the directory
xattrs when the name is not found as a SA xattr. Otherwise, the SA
error which should be returned to the caller is overwritten by the
directory xattr errors. Positive return values indicating success
will also be immediately returned.
In the case of #1437 the ERANGE error was being correctly returned
by zpl_xattr_get_sa() only to be overridden with ENOENT which was
returned by the subsequent unnessisary call to zpl_xattr_get_dir().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1437
As a part of scrub/resilver tuning zfs_scrub_limit fell out of use,
but the definition of the variable remained in place.
Moreover various guides still (misleadingly) mention it as a way
to influence resilver/scrub behavior.
This commit removes its finally.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1444
The assertions in ddt_phys_decref and ddt_sync_entry cast ddp->ddp_refcnt
from uint64_t to int64_t, with a reference count bigger than 2^63, e.g. the
reference count of zero blocks commonly available in spare files, we may
mistakenly hit these assertations, so drop the type conversions here.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1436
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions. These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with. In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.
To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be
performed with the majority of the stack space available. This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1399Closes#1423
3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@ec94d32https://illumos.org/issues/3581
Notes for Linux port:
Earlier commit 08d08eb reduced contention on this taskq lock by simply
reducing the number of z_fr_iss threads from 100 to one-per-CPU. We
also optimized the taskq implementation in zfsonlinux/spl@3c6ed54.
These changes significantly improved unlink performance to acceptable
levels.
This patch further reduces time spent spinning on this lock by
randomly dispatching the work items over multiple independent task
queues. The Illumos ZFS developers stated that this lock contention
only arose after "3329 spa_sync() spends 10-20% of its time in
spa_free_sync_cb()" was landed. It's not clear if 3329 affects the
Linux port or not. I didn't see spa_free_sync_cb() show up in
oprofile sessions while unlinking large files, but I may just not
have used the right test case.
I tested unlinking a 1 TB of data with and without the patch and
didn't observe a meaningful difference in elapsed time. However,
oprofile showed that the percent time spent in taskq_thread() was
reduced from about 16% to about 5%. Aside from a possible slight
performance benefit this may be worth landing if only for the sake of
maintaining consistency with upstream.
Ported-by: Ned Bass <bass6@llnl.gov>
Closes#1327
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
illumos/illumos-gate@01f55e48fbhttps://www.illumos.org/issues/3329https://www.illumos.org/issues/3330https://www.illumos.org/issues/3331https://www.illumos.org/issues/3335
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
page and help
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@31d7e8fa33https://www.illumos.org/issues/3306https://www.illumos.org/issues/3321
The vdev_file.c implementation in this patch diverges significantly
from the upstream version. For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.
1) The vdev_disk.c code in ZoL has been rewritten to use the
Linux block device interfaces which differ significantly
from those in Illumos. Therefore, updating the vdev_file.c
to use the Illumos interfaces doesn't get you consistency
with vdev_disk.c.
2) Using the upstream patch as is would requiring implementing
compatibility code for those Solaris block device interfaces
in user and kernel space. That additional complexity could
lead to confusion and doesn't buy us anything.
For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function. This has the advantage of being
low risk and easy to understand. Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.
Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased. The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention. We increase it to 8 until
the contention can be address by porting Illumos #3581.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1354
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606bhttps://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1396
A deadlock was accidentally introduced by commit e95853a which
can occur when the system is under memory pressure. What happens
is that while the txg_quiesce thread is holding the tx->tx_cpu
locks it enters memory reclaim. In the context of this memory
reclaim it then issues synchronous I/O to a ZVOL swap device.
Because the txg_quiesce thread is holding the tx->tx_cpu locks
a new txg cannot be opened to handle the I/O. Deadlock.
The fix is straight forward. Move the memory allocation outside
the critical region where the tx->tx_cpu locks are held. And for
good measure change the offending allocation to KM_PUSHPAGE to
ensure it never attempts to issue I/O during reclaim.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1274
According to the getxattr(2) man page the ERANGE errno should be
returned when the size of the value buffer is to small to hold the
result. Prior to this patch the implementation would just truncate
the value to size bytes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1408
The zpl_readdir() function shouldn't be registered as part of
the zpl_file_operations table, it must only be part of the
zpl_dir_file_operations table. By removing this callback
the VFS will now correctly return ENOTDIR when calling
getdents() on a file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1404
Previous patches have allowed you to set an increased ashift to
avoid doing 512b IO with 4k sector devices. However, it was not
possible to set the ashift lower than the reported physical sector
size even when a smaller logical size was supported. In practice,
there are several cases where settong a lower ashift is useful:
* Most modern drives now correctly report their physical sector
size as 4k. This causes zfs to correctly default to using a 4k
sector size (ashift=12). However, for some usage models this
new default ashift value causes an unacceptable increase in
space usage. Filesystems with many small files may see the
total available space reduced to 30-40% which is unacceptable.
* When replacing a drive in an existing pool which was created
with ashift=9 a modern 4k sector drive cannot be used. The
'zpool replace' command will issue an error that the new drive
has an 'incompatible sector alignment'. However, by allowing
the ashift to be manual specified as smaller, non-optimal,
value the device may still be safely used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1381Closes#1328
Issue #967
Issue #548
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@bda8819455https://www.illumos.org/issues/3422https://www.illumos.org/issues/3425
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1390
The assumption in zio_ddt_free() is that ddt_phys_select() must
always find a match. However, if that fails due to a damaged
DDT or some other reason the code will NULL dereference in
ddt_phys_decref().
While this should never happen it has been observed on various
platforms. The result is that unless your willing to patch the
ZFS code the pool is inaccessible. Therefore, we're choosing
to more gracefully handle this case rather than leave it fatal.
http://mail.opensolaris.org/pipermail/zfs-discuss/2012-February/050972.html
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1308
Enabling metaslab debugging will prevent space maps from being
automatically unloaded. This can significantly increase the
memory footprint but being able to dynamically control this is
helpful for debugging and certain performance testing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The mainline kernel started defining GCC_VERSION with commit
torvalds/linux@3f3f8d2f48.
Unfortunately, LZ4 also defines this macro, but the two
defintions are incompatible. We undefine GCC_VERSION in lz4.c
to handle this.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1339
A few files still refer to @behlendorf's private fork on
github. Use the primary web site URL instead. Two typos
are also corrected.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Provide a mechanism to control the directory name the modules
are installed in. The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/zfs'.
Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name. While we're here
change the default from 'addon/zfs' to 'extra' which is the
kernel.org default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices. By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/. When set to
'visible' all zvol snapshots for the dataset will be visible.
This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created. When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions. This leads
to a variety of issues:
1) The zvol partition tables will be read in the context of
the `modprobe zfs` for automatically imported pools. This
is undesirable and should be done asynchronously, but for
now reducing the number of visible devices helps.
2) Udev expects to be able to complete its work for a new
block devices fairly quickly. When many zvol devices are
added at the same time this is no longer be true. It can
lead to udev timeouts and missing /dev/zvol links.
3) Simply having lots of devices in /dev/ can be aukward from
a management standpoint. Hidding the devices your unlikely
to ever use helps with this. Any snapshot device which is
needed can be made visible by changing the snapdev property.
NOTE: This patch changes the default behavior for zvols which
was effectively 'snapdev=visible'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1235Closes#945
Issue #956
Issue #756
The changes to zvol.c were never merged from the last onnv_147
bulk update. This was because zvol.c was largely rewritten
for Linux making it fairly easy to miss these sorts of changes.
This causes a regression when importing a zpool with zvols
read-only. This does not impact pool which only contain
filesystem datasets.
References:
illumos/illumos-gate@f9af39b
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1332Closes#1333
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1300
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL). On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.
This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable. It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.
To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.
1) Functions which open the device but only read had the
O_EXCL flag removed and were updated to use O_RDONLY.
2) A common holder was added to the vdev disk code. This
allow the ZFS code to internally open the device multiple
times but non-ZFS callers may not.
3) An exception was added to make_disks() for hot spare when
creating partition tables. For hot spare devices which
are already opened exclusively we skip creating the partition
table because this must already have been done when the disk
was originally added as a hot spare.
Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix. And is_spare() was moved
above make_disks() to avoid adding a forward reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#250
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1. The wholedisk handling
is performed by the user space utilities. To prevent confusion
this dead code is being removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case. According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set. Under Linux
we would always close and open the device.
This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing. The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links. The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.
This patch adds the missing functionality in a similiar fashion to
to the Illumos code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.
Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.
As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.
This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.
Thanks to Richard Kojedzinszky for catching this nasty bug.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1318
The zfs_arc_memory_throttle_disable module option was introduced
by commit 0c5493d470 to resolve a
memory miscalculation which could result in the txg_sync thread
spinning.
When this was first introduced the default behavior was left
unchanged until enough real world usage confirmed there were no
unexpected issues. We've now reached that point. Linux's
direct reclaim is working as expected so we're enabling this
behavior by default.
This helps pave the way to retire the spl_kmem_availrmem()
functionality in the SPL layer. This was the only caller.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #938
A couple of assertions in spa.c were designed to prevent the use of
invalid pool versions. They were written under the assumption
that all valid pools are less than SPA_VERSION. Since feature flags
jumped from 28 to 5000, any numbers in the range 28 to 5000
non-inclusive will fail to trigger them. We switch to the new
SPA_VERSION_IS_SUPPORTED macro to correct this.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1282
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN. It does however
appear to correctly handle MAXPATHLEN for us.
The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function. Several in-tree filesystems
handle this case the same way.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1279
Two more locations where KM_SLEEP was used in a call which must
use KM_PUSHPAGE were found while using the zpool upgrade command.
See commit b8d06fc for additional details.
Also make a small correction to the comment block above
dsl_dir_open_spa().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1268
Explicitly case this value to an unsigned long long for 32-bit
systems to inform the compiler that a long type should not be
used. Otherwise we get the following compiler error:
dmu_send.c:376: error: integer constant is too large for
‘long’ type
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The way in which virtual box ab(uses) memory can throw off the
free memory calculation in arc_memory_throttle(). The result is
the txg_sync thread will effectively spin waiting for memory to
be released even though there's lots of memory on the system.
To handle this case I'm adding a zfs_arc_memory_throttle_disable
module option largely for virtual box users. Setting this option
disables free memory checks which allows the txg_sync thread to
make progress.
By default this option is disabled to preserve the current
behavior. However, because Linux supports direct memory reclaim
it's doubtful throttling due to perceived memory pressure is ever
a good idea. We should enable this option by default once we've
done enough real world testing to convince ourselve there aren't
any unexpected side effects.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#938
Commit 1eb5bfa introduced a new zfs_disable_dup_eviction tunable.
It should have been made available as a module option in the
original patch but was overlooked.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When a system attribute layout is created an inconsistency may occur
between the system attribute header (sa_hdr_phys_t) size and the
variable-sized attribute count stored in the layout. The inconsistency
results in the following failed assertion when SA_HDR_SIZE_MATCH_LAYOUT
returns false:
SPLError: 11315:0:(sa.c:1541:sa_find_idx_tab())
ASSERTION((IS_SA_BONUSTYPE(bonustype) && SA_HDR_SIZE_MATCH_LAYOUT(hdr,
tb)) || !IS_SA_BONUSTYPE(bonustype) || (IS_SA_BONUSTYPE(bonustype) &&
hdr->sa_layout_info == 0)) failed
The bug originates in this snippet from sa_find_sizes().
if (is_var_sz && var_size > 1) {
if (P2ROUNDUP(hdrsize + sizeof (uint16_t),
*total < full_space) {
hdrsize += sizeof (uint16_t);
This assumes that the current variable-sized attribute will be stored in
the current buffer and accounts for the space needed to store its size
in the sa_hdr_phys_t. However if the next attribute spills over we need
to store a blkptr_t at the end of the bonus buffer to point to the spill
block. If the current attribute is in the way of the blkptr_t then it
too will be relocated into the spill block. But since we've already
accounted for it in the header size we get the inconsistency described
above.
To avoid this, record the index of the last variable-sized attribute
that prompted a hdrsize increase, and reverse the increase if we later
determine that that attribute will be relocated to the spill block.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1250
A rounding discrepancy exists between how sa_build_layouts() and
sa_find_sizes() calculate when the spill block needs to be kicked in.
This results in a narrow size range where sa_build_layouts() believes
there must be a spill block allocated but due to the discrepancy there
isn't. A panic then occurs when the hdl->sa_spill NULL pointer is
dereferenced.
The following reproducer for this bug was isolated:
truncate -s 128m /tmp/tank
zpool create tank /tmp/tank
zfs create -o xattr=sa tank/fish
ln -s `perl -e 'print "z" x 41'` /tank/fish/z
setfattr -hn trusted.foo -v`perl -e 'print "z"x45'` /tank/fish/z
This test results in roughly the following system attribute (SA)
layout:
176 bytes - "standard" SA's
41 bytes - name of symbolic link target
100 bytes - XDR encoded nvlist for xattr
---
317 bytes - total
Because 317 is less than DN_MAX_BONUSLEN (320), sa_find_sizes()
decides no spill block is needed. But sa_build_layouts() rounds 41 up
to 48 when computing the space requirements so it tries to switch to
the spill block.
Note that we were only able to reproduce this bug using a combination
of symbolic links and the Linux-specific xattr=sa dataset property.
So while this issue is not technically Linux-specific, it may be
difficult or impossible to hit the narrow size range needed to
reproduce it on other platforms.
To fix the discrepancy, round the running total in sa_find_sizes() up
to an 8-byte boundary before accounting for each SA, since this is how
they will be stored in the bonus and (possibly) spill buffers.
To make the intent of the code more clear, explicitly assert key
assumptions about expected alignment of data and whether spill-over
will occur.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1240
3035 LZ4 compression support in ZFS and GRUB
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>
References:
illumos/illumos-gate@a6f561b4aehttps://www.illumos.org/issues/3035http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS
This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux. Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.
Support for GRUB has been dropped from this patch. That code
is available but those changes will need to made to the upstream
GRUB package.
Lastly, several hunks of dead code were dropped for clarity. They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.
Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1217
Explicitly set acl details to zero to silence gcc (zfs_acl_node_read
can't be sure zfs_acl_znode_info will set acl_count and aclsize).
Normally suppressing these warnings by setting this to zero at
declaration time is a bad idea but in this instance it's hard to
avoid and should be fairly safe.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1244
Retire the dmu_snapshot_id() function which was introduced in the
initial .zfs control directory implementation. There is already
an existing dsl_dataset_snap_lookup() which does exactly what we
need, and the dmu_snapshot_id() function as implemented is racy.
https://github.com/zfsonlinux/zfs/issues/1215#issuecomment-12579879
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1238
Added d_clear_d_op() helper function which clears some flags and the
registered dentry->d_op table. This is required because d_set_d_op()
issues a warning when the dentry operations table is already set.
For the .zfs control directory to work properly we must be able to
override the default operations table and register custom .d_automount
and .d_revalidate callbacks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1230
Callers of zap_deref_leaf() must be careful to drop leaf->l_rwlock
since that function returns with the lock held on success. All other
callers drop the lock correctly but it seems fzap_cursor_move_to_key()
does not. This may block writers or cause VERIFY failures when the
lock is freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1215Closeszfsonlinux/spl#143Closeszfsonlinux/spl#97
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter. In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.
Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1226
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1223
Commit 65d56083b4 fixes the lock
inversion between spa_namespace_lock and bdev->bd_mutex but only
for the first user of spa_namespace_lock: dmu_objset_own().
Later spa_namespace_lock gets acquired by dsl_prop_get_integer()
though dsl_prop_get()->dsl_dataset_hold()->dsl_dir_open_spa()->
spa_open()->spa_open_common() without this "protection". By
moving the mutex release after this second use, even this
acquisition of the lock is "protected" by the ERESTARTSYS trick.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1220
This reverts commit 53c7411919
effectively reinstating the asynchronous xattr cleanup code.
These Linux changes were reverted because after testing
and careful contemplation I was convinced that due to the
89260a1c8851ce05ea04b23606ba438b271d890 commit they were no
longer required.
Unfortunately, the deadlock described in #1176 was a case
which wasn't considered. At mount zfs_unlinked_drain() can
occur which will unlink a list of znodes in effectively a
random order which isn't safe. The only reason it was safe
to originally revert this change was the we could guarantee
that the VFS would always prune the xattr leaves before the
parents.
Therefore, until we can cleanly resolve this deadlock for
all cases we need to keep this change in spite of the xattr
unlink performance penalty associated with it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1176
Issue #457
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL. The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.
Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back. Then a new inode with the same inode number would
be create and collide with the existing cached inode. Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.
Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.
The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly. If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.
This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled. The inode
will then only be accessible from the cached dentries. Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated. Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.
Special care is taken for negative dentries since they do not
reference any inode. These dentires will be invalidate based
on when they were added to the dentry cache. Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.
Two nice side effects of this fix are:
* Removes the dependency on spl_invalidate_inodes(), it can now
be safely removed from the SPL when we choose to do so.
* zfs_znode_alloc() no longer requires a dentry to be passed.
This effectively reverts this portition of the code to its
upstream counterpart. The dentry is not instantiated more
correctly in the Linux ZPL layer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#795
Lookups in the snapshot control directory for an existing snapshot
fail with ENOENT if an earlier lookup failed before the snapshot was
created. This is because the earlier lookup causes a negative dentry
to be cached which is never invalidated.
The bug can be reproduced as follows (the second ls should succeed):
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
$ zfs snap tank@s
$ ls /tank/.zfs/snapshot/s
ls: cannot access /tank/.zfs/snapshot/s: No such file or directory
To remedy this, always invalidate cached dentries in the snapshot
control directory. Since these entries never exist on disk there is
no significant performance penalty for the extra lookups.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1192
A misplaced single quote caused the umount command to fail with a
syntax error when unmounting snapshots under the .zfs/snapshot
control directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1210
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@8f0b538d1d
changeset: 13818:e9ad0a945d45
https://www.illumos.org/issues/3189
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2618 arc.c mistypes in the comments
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@fc98fea58e
illumos changeset: 13721:5b51a16a186f
https://www.illumos.org/issues/2618
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process). A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.
One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
This reverts commit 7afcf5b1da which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot. It updates the vfsmount such that it references the
original dataset vfsmount. The result is that the snapshot itself
isn't visible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
Related to 91579709fc we need to
be very careful about not overrunning the stack in kernel space.
However, in user space we're already allowing slightly larger
stacks so this stack usage optimization is not required there.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To save valuable stack all zio's were made asynchronous when in the
tgx_sync_thread context or during pool initialization. See commit
2fac4c2 for the original patch and motivation.
Unfortuantely, the changes to dsl_pool_sync_context() made by the
feature flags broke this logic causing in __zio_execute() to dispatch
itself infinitely when called during pool initialization. This
commit refines the existing logic to specificly target only the two
cases we care about.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@25345e4666https://www.illumos.org/issues/3349
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning. To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains
at most 3 block pointers this overhead should be negligible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#950
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.
The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device. This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.
This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely on io_schedule() via cv_wait_io().
The additional debugging information which was added to the
delay event has been intentionally left in place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#612
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit(). This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.
However, under Linux I'm not seeing the same contention that
was originally described. Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD. If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.
This just leaves one small ZFS TSD consumer. If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#174
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1149
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1131
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764Closes#1140
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers. While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.
I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel. But frankly
the compatibility code to use this kernel interface has always
been problematic.
Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1053
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning. See commit
b8d06fca08 for additional details.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1138
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.
This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently. This effectively prevents
the deadlock condition from ever occuring.
Therefore we can safely revert back to the upstream synchronous
cleanup code. This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#457
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().
But more importantly than preserving the exact time add the missing
call to sa_bulk_update(). This ensures that the znode modifications
are written to disk as part of the transaction. Without this the
inode may mistaken rollback to the previous on-disk znode state.
Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode. This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.
Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent. Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it. Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.
This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.
The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX. Otherwise we
run the risk of deadlocking on the open create TX.
Ideally the security xattr should be fully constructed
before the new inode is unlocked. However, doing so would
require far more extensive changes to ZFS.
This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#671
Add the missing error handling to load_nvlist(). There's no good
reason this needs to be fatal. All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned. This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1120
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings. The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown. To avoid this entirely the zv->objset
line is moved up in to the success block.
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1109
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Refererces to Illumos issue:
https://www.illumos.org/issues/2671
This patch has been slightly modified from the upstream Illumos
version. In the upstream implementation a warning message is
logged to the console. To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.
The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool. Depending on your workload
this may impact pool performance. Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.
NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#955
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().
Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record. The mmap
case which relies on zfs_putpage() was not taking the range
lock. This code was accidentally dropped when the function
was rewritten for the Linux VFS.
This patch adds the required range locking to zfs_putpage().
It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.
Original-patch-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#541
When using a zvol to back a btrfs filesystem the btrfs mount
would hang. This was due to the bio completion callback used
in btrfs assuming that lower level drivers would never modify
the bio->bi_io_vecs after they were submitted via bio_submit().
If they are modified btrfs will miscalculate which pages need
to be unlocked resulting in a hang.
It's worth mentioning that other file systems such as ext[234]
and xfs work fine because they do not make the same assumption
in the bio completion callback.
The most straight forward way to fix the issue is to present
the semantics expected by btrfs. This is done by cloning the
bios attached to each request and then using the clones bvecs
to perform the required accounting. The clones are freed after
each read/write and the original unmodified bios are linked back
in to the request.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#469
There have been reports of ZFS deadlocking due to what appears to
be a lost IO. This patch addes some debugging to determine the
exact state of the IO which neither 1) completed, 2) failed, or
3) timed out after zio_delay_max (30) seconds.
This information will be logged using the ZFS FMA infrastructure
as a 'delay' event and posted to the internal zevent log. By
default the last 64 events will be kept in the log but the limit
is configurable via the zfs_zevent_len_max module option.
To dump the contents of the log use the 'zpool events -v' command
and look for the resource.fs.zfs.delay event. It will include
various information about the pool, vdev, and zio which may shed
some light on the issue.
In the context of this change the 120 second kernel blocked thread
watchdog has been disabled for synchronous IOs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #930
Create a kstat file which contains useful statistics about the
last N txgs processed. This can be helpful when analyzing pool
performance. The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The interface for the ddt_zap_count() function assumes it can
never fail. However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail. Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.
Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails. The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.
This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller. This allows a pool which has
be damaged in this way to be safely rewound for import.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#910
This reverts commit a5c20e2a0a which
accidentally introduced a regression for real 4k sector devices.
See issue #1065 for details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1065
The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless. To prevent needless confusion and worry the message
has been removed.
kernel: WARNING: Resized bio's/dio to 32
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When automounting a snapshot in the .zfs/snapshot directory
make sure to quote both the dataset name and the mount point.
This ensures that if either component contains spaces, which
are allowed, they get handled correctly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1027
In the current code, logbias=throughput implies the following:
1) All synchronous writes are logged in indirect mode.
2) The slog is not used.
(1) makes sense because it avoids writing the data twice, which is
obviously a good thing when the user wants maximum pool throughput.
(2), however, is a surprising decision. Considering all writes are
indirect, the log record doesn't contain the actual data, only pointers
to DMU blocks. As a result, log records written in logbias=throughput
mode are quite small, and as such, it doesn't make any sense to write
them to the main pool since slogs are usually optimized for small
synchronous writes.
In fact, the current behavior is actually harmful for performance,
because log blocks and data blocks from dmu_sync() seldom have the same
allocation size and as a result are usually allocated from different
metaslabs. This means that if a spindle has to write both log blocks and
DMU blocks (which is likely to happen under heavy load), it will have to
seek between the two. Allocating the log blocks from the slog pool
instead of the main pool avoids these unnecessary seeks.
This commit makes ZFS use the slog on datasets with logbias=throughput.
Real-life performance testing shows a 50% synchronous write performance
increase with some large commit sizes, and no negative effect in other
cases.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
Currently, ZIL blocks are spread over vdevs using hint block pointers
managed by the ZIL commit code and passed to metaslab_alloc(). Spreading
log blocks accross vdevs is important for performance: indeed, using
mutliple disks in parallel decreases the ZIL commit latency, which is
the main performance metric for synchronous writes. However, the current
implementation suffers from the following issues:
1) It would be best if the ZIL module was not aware of such low-level
details. They should be handled by the ZIO and metaslab modules;
2) Because the hint block pointer is managed per log, simultaneous
commits from multiple logs might use the same vdevs at the same time,
which is inefficient;
3) Because dmu_write() does not honor the block pointer hint, indirect
writes are not spread.
The naive solution of rotating the metaslab rotor each time a block is
allocated for the ZIL or dmu_sync() doesn't work in practice because the
first ZIL block to be written is actually allocated during the previous
commit. Consequently, when metaslab_alloc() decides the vdev for this
block, it will do so while a bunch of other allocations are happening at
the same time (from dmu_sync() and other ZILs). This means the vdev for
this block is chosen more or less at random. When the next commit
happens, there is a high chance (especially when the number of blocks
per commit is slightly less than the number of the disks) that one disk
will have to write two blocks (with a potential seek) while other disks
are sitting idle, which defeats spreading and increases the commit
latency.
This commit introduces a new concept in the metaslab allocator:
fastwrites. Basically, each top-level vdev maintains a counter
indicating the number of synchronous writes (from dmu_sync() and the
ZIL) which have been allocated but not yet completed. When the metaslab
is called with the FASTWRITE flag, it will choose the vdev with the
least amount of pending synchronous writes. If there are multiple vdevs
with the same value, the first matching vdev (starting from the rotor)
is used. Once metaslab_alloc() has decided which vdev the block is
allocated to, it updates the fastwrite counter for this vdev.
The rationale goes like this: when an allocation is done with
FASTWRITE, it "reserves" the vdev until the data is written. Until then,
all future allocations will naturally avoid this vdev, even after a full
rotation of the rotor. As a result, pending synchronous writes at a
given point in time will be nicely spread over all vdevs. This contrasts
with the previous algorithm, which is based on the implicit assumption
that blocks are written instantaneously after they're allocated.
metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to
manually increase or decrease fastwrite counters, respectively. They
should be used with caution, as there is no per-BP tracking of fastwrite
information, so leaks and "double-unmarks" are possible. There is,
however, an assert in the vdev teardown code which will fire if the
fastwrite counters are not zero when the pool is exported or the vdev
removed. Note that as stated above, marking is also done implictly by
metaslab_alloc().
ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to
the metaslab when allocating (assuming ZIO does the allocation, which is
only true in the case of dmu_sync). This flag will also trigger an
unmark when zio_done() fires.
A side-effect of the new algorithm is that when a ZIL stops being used,
its last block can stay in the pending state (allocated but not yet
written) for a long time, polluting the fastwrite counters. To avoid
that, I've implemented a somewhat crude but working solution which
unmarks these pending blocks in zil_sync(), thus guaranteeing that
linguering fastwrites will get pruned at each sync event.
The best performance improvements are observed with pools using a large
number of top-level vdevs and heavy synchronous write workflows
(especially indirect writes and concurrent writes from multiple ZILs).
Real-life testing shows a 200% to 300% performance increase with
indirect writes and various commit sizes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following incorrect usage of cv_signal and cv_broadcast()
was caught by code inspection. The cv_signal and cv_broadcast()
functions must be called under the associated mutex to preventing
racing with cv_wait().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following incorrect usage of cv_broadcast() was caught by
code inspection. The cv_broadcast() function must be called
under the associated mutex to preventing racing with cv_wait().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1038
Prevent users from setting the zfs_vdev_aggregation_limit tuning
larger than SPA_MAXBLOCKSIZE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#520
Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes#626#885#947#977
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create. Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup. Instead
only the inamedata->flags are passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.
ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.
ZFS never made use of this callback so it can simply be removed
from the super_operations structure.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.
This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.
For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.
We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.
As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
the path that the additional, useless padding data has to travel.
For example, when using 4k-sector drives that lie about their sector
size, using 512b read requests instead of 4k means that there will
be less data traveling down the ATA/SCSI interface, even though the
drive actually reads 4k from the platter.
The only exception is raidz, because raidz needs to read the whole
allocated block for parity.
This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1022
There are currently three vmem_size() consumers all of which are
part of the ARC implemention. However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.
* arc_evict_needed() - This is actually dead code. Arena support
was never added to the SPL and zio_arena is always NULL. This
support isn't needed so we simply remove this dead code.
* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free. However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC. So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.
* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#831
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time. By pulling it down out of the variable declaration it will
be evaluted each time through the loop.
The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c. This prevents
potential type mismatches if the global variable needs to be used
elsewhere.
Move the module_param() code in to the same source file where
zfs_txg_timeout is declared. This is the most logical location.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit c409e4647f introduced a
number of module parameters. This required several types to be
changed to accomidate the required module parameters Linux macros.
Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart. If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.
The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.
We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.
Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#749
zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1032
Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1030
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device. Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available. This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.
We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes. It's
better to simply manually tune the scheduler for these kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1017
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.
Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.
This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1010
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fc for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1002
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222
3100 zvol rename fails with EBUSY when dirty
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#995
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.
Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#906
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook. All it takes there
is to make sure changes are committed to ZIL. Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#969
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#949Closes#931Closes#789Closes#743Closes#730
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.
The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#933
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.
The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit. Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#961
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed. This will result in a
NULL dereference as shown by the stack trace is issue #782.
This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#782
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL. However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().
Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case. Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy(). However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.
This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.
https://www.illumos.org/issues/3170
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#888
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.
These objects can be as large as order 5 on a system with 4k
pages. And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation. This can result in the occasional
allocation failure.
However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed. Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback. However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit. There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.
This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds. One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#700Closes#907
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.
This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected. In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected. When debugging is enabled
any misuse will be treated as a fatal error.
This patch is entirely for debugging. We should be careful to
NOT become dependant on it fixing up the incorrect allocations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests. Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...
1) These buffers are short lived. They are only required for
the life of a single I/O at which point they can be used by
the next I/O.
2) The maximum number of concurrent buffers needed by a vdev is
small. It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.
By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it. This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock. This is
particularly important when memory is already low on the system.
It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation. The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated. The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC. That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks. This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.
The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it. This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.
As such, we revert this patch in favor of a proper fix for both issues.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC. Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries. However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has
.. as its final component. POSIX.1-2001 also allows EEXIST for
this condition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#895
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update. When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit(). Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.
This could be left to the sa_update() caller. But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#503Closes#513
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727. It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#709Closes#727
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols. Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold. This isn't harmful so we suppress
the warning for this case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#628
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.
Remove three stale prototype definites from dbuf.h. The actual
implementations of these functions were removed/renamed long ago.
It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1693
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#678
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.
In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).
With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#862
The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.
Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.
This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#858
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
destroying multiple snapshots
1708 adjust size of zpool history data
References:
https://www.illumos.org/issues/1644https://www.illumos.org/issues/1645https://www.illumos.org/issues/1646https://www.illumos.org/issues/1647https://www.illumos.org/issues/1708
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. This change has reordered
all of the new ioctl()s to the end of the list. This should
help minimize this issue in the future.
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#826Closes#664
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building ZFS as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the zfs source package.
To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the zfs
source package.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.
However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#784
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.
NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
The .zfs control directory implementation currently relies on
the fact that there is a direct 1:1 mapping from an object id
to its inode number. This works well as long as the system
uses a 64-bit value to store the inode number.
Unfortunately, the Linux kernel defines the inode number as
an 'unsigned long' type. This means that for 32-bit systems
will only have 32-bit inode numbers but we still have 64-bit
object ids.
This problem is particularly acute for the .zfs directories
which leverage those upper 32-bits. This is done to avoid
conflicting with object ids which are allocated monotonically
starting from 0. This is likely to also be a problem for
datasets on 32-bit systems with more than ~2 billion files.
The right long term fix must remove the simple 1:1 mapping.
Until that's done the only safe thing to do is to disable the
.zfs directory on 32-bit systems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the missing error handling to ddt_object_load(). There's no
good reason this needs to be fatal. It is preferable that an
error be returned. This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The '__attribute__((always_inline))' does not strictly imply
'inline'. Newer versions of gcc detect this misuse and issue
the following warning. Including the missing 'inline' resolves
the build warning.
./module/zfs/dsl_scan.c:758:1:error: always_inline function
might not be inlinable [-Werror=attributes]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.
To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link(). This avoids compiler errors
in the PaX/GRSecurity dialect.
Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#484
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.
This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.
Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
This change also depends on API changes which are available in 2.6.37
and latter kernels. The build system has been updated to detect this,
but there is no compatibility mode for older kernels. This means that
online expansion will NOT be available in older kernels. However, it
will still be possible to expand the vdev offline.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#808
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
https://www.illumos.org/issues/1949https://www.illumos.org/issues/1953
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#665
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1748
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#665
When the ddt_zap_lookup() function was updated to dynamically
allocate memory for the cbuf variable, to save stack space, the
'csize <= sizeof (cbuf)' assertion was not updated. The result
of this was that the size of the pointer was being used in the
comparison rather than the buffer size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
The performance of the ZIL is usually the main bottleneck when dealing with
synchronous, write-heavy workloads (e.g. databases). Understanding the
behavior of the ZIL is required to diagnose performance issues for these
workloads, and to tune ZIL parameters (like zil_slog_limit) accordingly.
This commit adds a new kstat page dedicated to the ZIL with some counters
which, hopefully, scheds some light into what the ZIL is doing, and how it is
doing it.
Currently, these statistics are available in /proc/spl/kstat/zfs/zil.
A description of the fields can be found in zil.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#786
FreeBSD #xxx: Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
NOTE: This patch only appears in FreeBSD. If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
ZoL can create more zvols at runtime than can be configured during
system start, which hangs the init stack at reboot.
When a slow system has more than a few hundred zvols, udev will
fork bomb during system start and spend too much time in device
detection routines, so upstart kills it.
The zfs_inhibit_dev option allows an affected system to be rescued
by skipping /dev/zd* creation and thereby avoiding the udev
overload. All zvols are made inaccessible if this option is set, but
the `zfs destroy` and `zfs send` commands still work, and ZFS
filesystems can be mounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zil_slog_limit specifies the maximum commit size to be written to the separate
log device. Larger commits bypass the separate log device and go directly to
the data devices.
The optimal value for zil_slog_limit directly depends on the latency and
throughput characteristics of both the separate log device and the data disks.
Small synchronous writes are faster on low-latency separate log devices (e.g.
SSDs) whereas large synchronous writes are faster on high-latency data disks
(e.g. spindles) because of higher throughput, especially with a large array.
The point is, the line between "small" and "large" synchronous writes in this
scenario is heavily dependent on the hardware used. That's why it should be
made configurable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#783
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
The logbias option is not taken into account when writing to ZVOLs. We fix
that by using the same logic as in the zfs filesystem write code
(see zfs_log.c).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#774
This reverts commit ce90208cf9. This
change was observed to cause problems when using a zvol to back a VM
under 2.6.32.59 kernels. This issue was filed as #710.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #342
Issue #710
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#342
23bdb07d4e updated the ARC memory limits
to be 1/2 of memory or all but 4GB. Unfortunately, these values assume
zero internal fragmentation in the SLUB allocator, when in reality, the
internal fragmentation could be as high as 50%, effectively doubling
memory usage. This poses clear safety issues, because it permits the
size of ARC to exceed system memory.
This patch changes this so that the default value of arc_c_max is always
1/2 of system memory. This effectively limits the ARC to the memory that
the system has physically installed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#660
Under Solaris the ARC was designed to stay one step ahead of the
VM subsystem. It would attempt to recognize low memory situtions
before they occured and evict data from the cache. It would also
make assessments about if there was enough free memory to perform
a specific operation.
This was all possible because Solaris exposes a fairly decent
view of the memory state of the system to other kernel threads.
Linux on the other hand does not make this information easily
available. To avoid extensive modifications to the ARC the SPL
attempts to provide these same interfaces. While this works it
is not ideal and problems can arise when the ARC and Linux have
different ideas about when your out of memory. This has manifested
itself in the past as a spinning arc_reclaim_thread.
This patch abandons the emulated Solaris interfaces in favor of
the prefered Linux interface. That means moving the bulk of the
memory reclaim logic out of the arc_reclaim_thread and in to the
evict driven shrinker callback. The Linux VM will call this
function when it needs memory. The ARC is then responsible for
attempting to free the requested amount of memory if possible.
Several interfaces have been modified to accomidate this approach,
however the basic user space implementation remains the same.
The following changes almost exclusively just apply to the kernel
implementation.
* Removed the hdr_recl() reclaim callback which is redundant
with the broader arc_shrinker_func().
* Reduced arc_grow_retry to 5 seconds from 60. This is now used
internally in the ARC with arc_no_grow to indicate that direct
reclaim was recently performed. This typically indicates a
rapid change in memory demands which the kswapd threads were
unable to keep ahead of. As long as direct reclaim is happening
once every 5 seconds arc growth will be paused to avoid further
contributing to the existing memory pressure. The more common
indirect reclaim paths will not set arc_no_grow.
* arc_shrink() has been extended to take the number of bytes by
which arc_c should be reduced. This allows for a more granual
reduction of the arc target. Since the kernel provides a
reclaim value to the arc_shrinker_func() this value is used
instead of 1<<arc_shrink_shift.
* arc_reclaim_needed() has been removed. It was used to determine
if the system was under memory pressure and relied extensively
on Solaris specific VM interfaces. In most case the new code
just checks arc_no_grow which indicates that within the last
arc_grow_retry seconds direct memory reclaim occurred.
* arc_memory_throttle() has been updated to always include the
amount of evictable memory (arc and page cache) in its free
space calculations. This space is largely available in most
call paths due to direct memory reclaim.
* The Solaris pageout code was also removed to avoid confusion.
It has always been disabled due to proc_pageout being defined
as NULL in the Linux port.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Expose the zfs_mdcomp_disable variable as a module option. This
can be used to disable compression of zfs meta data which is
enabled by default. This shouldn't need to be tuned but for
most workloads, however there may be very specific instances
where it makes sense to trade disk capacity for extra cpu cycles.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Refererces to Illumos issue:
https://www.illumos.org/issues/1909
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#680
There is potential for deadlock in the l2arc_feed thread if KM_PUSHPAGE
is not used for the allocations made in l2arc_write_buffers.
Specifically, if KM_PUSHPAGE is not used for these allocations, it is
possible for reclaim to be triggered which can cause the l2arc_feed
thread to deadlock itself on the ARC_mru mutex. An example of this is
demonstrated in the following backtrace of the l2arc_feed thread:
crash> bt 4123
PID: 4123 TASK: ffff88062f8c1500 CPU: 6 COMMAND: "l2arc_feed"
0 [ffff88062511d610] schedule at ffffffff814eeee0
1 [ffff88062511d6d8] __mutex_lock_slowpath at ffffffff814f057e
2 [ffff88062511d748] mutex_lock at ffffffff814f041b
3 [ffff88062511d768] arc_evict at ffffffffa05130ca [zfs]
4 [ffff88062511d858] arc_adjust at ffffffffa05139a9 [zfs]
5 [ffff88062511d878] arc_shrink at ffffffffa0513a95 [zfs]
6 [ffff88062511d898] arc_kmem_reap_now at ffffffffa0513be8 [zfs]
7 [ffff88062511d8c8] arc_shrinker_func at ffffffffa0513ccc [zfs]
8 [ffff88062511d8f8] shrink_slab at ffffffff8112a17a
9 [ffff88062511d958] do_try_to_free_pages at ffffffff8112bfdf
10 [ffff88062511d9e8] try_to_free_pages at ffffffff8112c3ed
11 [ffff88062511da98] __alloc_pages_nodemask at ffffffff8112431d
12 [ffff88062511dbb8] kmem_getpages at ffffffff8115e632
13 [ffff88062511dbe8] fallback_alloc at ffffffff8115f24a
14 [ffff88062511dc68] ____cache_alloc_node at ffffffff8115efc9
15 [ffff88062511dcc8] __kmalloc at ffffffff8115fbf9
16 [ffff88062511dd18] kmem_alloc_debug at ffffffffa047b8cb [spl]
17 [ffff88062511dda8] l2arc_feed_thread at ffffffffa0511e71 [zfs]
18 [ffff88062511dea8] thread_generic_wrapper at ffffffffa047d1a1 [spl]
19 [ffff88062511dee8] kthread at ffffffff81090a86
20 [ffff88062511df48] kernel_thread at ffffffff8100c14a
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
1356 zfs dataset prefetch code not working
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/1346https://www.illumos.org/issues/1356
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#647
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/1475
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#648
1952 memory leak when adding a file-based l2arc device
1954 leak in ZFS from metaslab_group_create and zfs_ereport_checksum
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References to Illumos issues:
https://www.illumos.org/issues/1951https://www.illumos.org/issues/1952https://www.illumos.org/issues/1954
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#650
This change appears to be exclusive to SmartOS. It is not present in
illumos-gate but it just adds some needed error handling. This is
clearly preferable to simply ASSERTING which is what would occur
prior to the patch.
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#652
vdev_tsd can be NULL for certain vdev states.
At least in userland testing with ztest.
References to Illumos issue:
https://www.illumos.org/issues/1680
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#655
Principly these symbols were exported to get access to the
dsl_prop_register/dsl_prop_unregister functions. They allow
us to cleanly register a callback which is called when a
dataset property is modified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When zpl_fill_super -> zfs_domount fails (e.g. because the dataset
was destroyed before it could be successfully mounted) the subsequent
call to zpl_kill_sb -> zfs_preumount would derefence a NULL pointer.
This bug can be reproduced using this shell script:
#!/bin/sh
(
while true; do
zfs create -o mountpoint=legacz tank/bar
zfs destroy tank/bar
done
) &
(
while true; do
mount -t zfs tank/bar /mnt
umount /mnt
done
) &
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#639
Due to a typo the mru ghost lists stats were accidentally being
exposed as the mfu ghost list stats. This was harmless but
confusing since memory usage could be over reported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging. When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.
This checking is particularly helpful when adding new dmu consumers
like Lustre. However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.
--enable-debug-dmu-tx - Enable/disable validation of each tx as
--disable-debug-dmu-tx it is constructed. By default validation
is disabled due to performance concerns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following assertion is good to validate the correctness of
new DMU consumers, but it doesn't quite provide enough information.
Slightly rework the assertion so that when it is hit the actual
offending values will be included in the output.
SPLError: 4787:0:(dmu_tx.c:828:dmu_tx_dirty_buf())
ASSERTION(dn == NULL || dn->dn_assigned_txg == tx->tx_txg) failed
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indidcate exactly what version of ZFS has been
loaded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because the .zfs ctldir inodes are not backed by physical storage
they use a different create path which was not properly accounting
for them as used. This could result in ->nr_cached_objects()
returning 0 and cause a divide by zero error in prune_super().
In my option there's a kernel bug here too which allows this to
happen. They should either be checking for 0 or adding +1 like
they correctly do earlier in the function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#617
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
Add a standard zio constructor and destructor. Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations. However, in this
case none of the operations moved out of zio_create() were really
very expensive.
This change was principly made as a debug patch (and workaround)
for a zio_destroy() race. The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread. This would completely explain the observed symptoms in
the issue report.
This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor. Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.
Once the real root cause is determined and resolved this change
can be safely reverted. Until then this should help workaround
the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
This patch was slightly flawed and allowed for zio->io_logical
to potentially not be reinitialized for a new zio. This could
lead to assertion failures in specific cases when debugging is
enabled (--enable-debug) and I/O errors are encountered. It
may also have caused problems when issues logical I/Os.
Since we want to make sure this workaround can be easily removed
in the future (when we have the real fix). I'm reverting this
change and applying a new version of the patch which includes
the zio->io_logical fix.
This reverts commit 2c6d0b1e07.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #602
Issue #604
The xattr_resolve_name() helper function expects the registered
list of xattr handlers to be NULL terminated. This NULL was
accidentally missing which could result in a NULL dereference.
Interestingly this issue only manifested itself on certain 32-bit
systems. Presumably on 64-bit kernels we just always happen to
get lucky and the memory following the structure is zeroed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #594
Add a SA interface which allows us to release the spill block
from a SA handle without destroying the handle. This is useful
because we can then ensure that a copy of the dirty spill block
is not made at sync time due to the extra hold. Susequent calls
to sa_update() or sa_lookup() with transparently refetch the
spill block dbuf from the ARC hash.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a standard zio constructor and destructor. Normally, this is
done to reduce to cost of allocating a new structure by reducing
expensive operations such as memory allocations. However, in this
case none of the operations moved out of zio_create() were really
very expensive.
This change was principly made as a debug patch (and workaround)
for a zio_destroy() race. The is good evidence that zio_create()
is reinitializing a mutex which is really still in use by another
thread. This would completely explain the observed symptoms in
the issue report.
This patch doesn't fix the root cause of the race, but it should
make it less likely by only initializing the mutex once in the
constructor. Also, this particular flaw might have gone unnoticed
in other zfs implementations due to the specific implementation
details of Linux ticket spinlocks.
Once the real root cause is determined and resolved this change
can be safely reverted. Until then this should help workaround
the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #496
A private SA handle must be used to ensure we can drop the dbuf
hold on the spill block prior to calling dmu_tx_commit(). If we
call dmu_tx_commit() before sa_handle_destroy(), then our hold
will trigger a copy of the dbuf to be made. This is done to
prevent data from leaking in to the syncing txg. As a result
the original dirty spill block will remain cached.
Additionally, relying on the shared zp->z_sa_hdl is unsafe in
the xattr context because the znode may be asynchronously dropped
from the cache. It's far safer and simpler just to use a private
handle for xattrs. Plus any additional overhead is offset by
the avoidance of the previously mentioned memory copy.
These forever dirty buffers can be noticed in the arcstats under
the anon_size. On a quiescent system the value should be zero.
Without this fix and a SA xattr write workload you will see
anon_size increase. Eventually, if enough dirty data builds up
your system it will appear to hang. This occurs because the dmu
won't allow new txs to be assigned until that dirty data is
flushed, and it won't be because it's not part of an assigned tx.
As an aside, I typically see anon_size lurk around 16k so I think
there is another place in the code which needs a similar fix.
However, this value doesn't grow over time so it isn't critical.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #503
Issue #513
Keep counters for the various reasons that a thread may end up
in txg_wait_open() waiting on a new txg. This can be useful
when attempting to determine why a particular workload is
under performing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To ensure the arc is behaving properly we need greater visibility
in to exactly how it's managing the systems memory. This patch
takes one step in that direction be adding the current arc_state_t
for the anon, mru, mru_ghost, mfu, and mfs_ghost lists. The l2
arc_state_t is already well represented in the arcstats.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export additional symbols to make use of the DMU's zero-copy
API. This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).
This value is still small enough that we can fit 16 uberblocks
in the vdev ring label. However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#565
Exported the required symbols to make use of the DMU's zero-copy
API. This allows external modules to move data in to and out of
the ARC without incurring the cost of a memory copy.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When configuring the spl debug log support use the provided wrapper
functions. This ensures that if --disable-debug-log was used when
buiding the spl the functions will have no effect.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().
Note that this only works for regular files, not ZVOLs.
This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.
This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:
> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.
We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.
On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.
We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#392
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#374.
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
Historically the internal zfs debug infrastructure has been
scattered throughout the code. Since we expect to start making
more use of this code this patch performs some cleanup.
* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
files. This includes moving the zfs_flags and zfs_recover
variables, plus moving the zfs_panic_recover() function.
* Remove the existing unused functionality in zfs_debug.c and
replace it with code which correctly utilized the spl logging
infrastructure.
* Remove the __dprintf() function from zfs_ioctl.c. This is
dead code, the dprintf() functionality in the kernel relies
on the spl log support.
* Remove dprintf() from hdr_recl(). This wasn't particularly
useful and was missing the required format specifier anyway.
* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap. The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.
This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id. The zpl
code will just ignore additional entries. This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.
Once the upstream feature flags support is merged in this change
should be updated to a read-only feature. Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export the zfs_attr_table symbol so it may be used by non-zpl
consumers which are still interested in writing a zpl compatible
dataset (e.g. Lustre).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The vdev_is_bootable() restrictions are no longer necessary
with recent GRUB2 code. FreeBSD has implemented the same
change, except that I moved the Solaris comment to be inside
the #ifdef __sun__ block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #317
As described in Issue #458 and #258, unlinking large amounts of data
can cause the threads in the zio free wait queue to start spinning.
Reducing the number of z_fr_iss threads from a fixed value of 100 to 1
per cpu signficantly reduces contention on the taskq spinlock and
improves throughput.
Instrumenting the taskq code showed that __taskq_dispatch() can spend
a long time holding tq->tq_lock if there are a large number of threads
in the queue. It turns out the time spent in wake_up() scales
linearly with the number of threads in the queue. When a large number
of short work items are dispatched, as seems to be the case with
unlink, the worker threads drain the queue faster than the dispatcher
can fill it. They then all pile into the work wait queue to wait for
new work items. So if 100 threads are in the queue, wake_up() takes
about 100 times as long, and the woken threads have to spin until the
dispatcher releases the lock.
Reducing the number of threads helps with the symptoms, but doesn't
get to the root of the problem. It would seem that wake_up()
shouldn't scale linearly in time with queue depth, particularly if we
are only trying to wake up one thread. In that vein, I tried making
all of the waiting processes exclusive to prevent the scheduler from
iterating over the entire list, but I still saw the linear time
scaling. So further investigation is needed, but in the meantime
reducing the thread count is an easy workaround.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Issue #458
Make the indenting in the zpl_xattr.c file consistent with the Sun
coding standard by removing soft tabs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The security_inode_init_security() API has been changed to include
a filesystem specific callback to write security extended attributes.
This was done to support the initialization of multiple LSM xattrs
and the EVM xattr.
This change updates the code to use the new API when it's available.
Otherwise it falls back to the previous implementation.
In addition, the ZFS_AC_KERNEL_6ARGS_SECURITY_INODE_INIT_SECURITY
autoconf test has been made more rigerous by passing the expected
types. This is done to ensure we always properly the detect the
correct form for the security_inode_init_security() API.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#516
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
It has been observed that some of the hottest locks are those
of the zio taskqs. Contention on these locks can limit the
rate at which zios are dispatched which limits performance.
This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock. This has the effect
of improving system performance.
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/734
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#482
The zvol_major and zvol_threads module options were being created
with 0 permission bits. This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`. This patch fixes the issue by updating
the permission bits to 0444. For the moment these options must
be read-only because they are used during module initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.
To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB. The rational for
this is as follows:
* Desktop Systems (less than 8GB of memory)
Limiting the ARC to 1/2 of memory is desirable for desktop
systems which have highly dynamic memory requirements. For
example, launching your web browser can suddenly result in a
demand for several gigabytes of memory. This memory must be
reclaimed from the ARC cache which can take some time. The
user will experience this reclaim time as a sluggish system
with poor interactive performance. Thus in this case it is
preferable to leave the memory as free and available for
immediate use.
* Server Systems (more than 8GB of memory)
Using all but 4GB of memory for the ARC is preferable for
server systems. These systems often run with minimal user
interaction and have long running daemons with relatively
stable memory demands. These systems will benefit most by
having as much data cached in memory as possible.
These values should work well for most configurations. However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC. This can still be accomplished
by setting the 'zfs_arc_max' module option.
Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation. Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory. How much more memory get's
consumed will be determined by how badly fragmented the slabs are.
In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution. Or preferably, using the page cache
to back the ARC under Linux would be even better. See issue #75
for the benefits of more tightly integrating with the page cache.
This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory. The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken. This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call. Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.
The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes. This was done simply to be consistent with the Solaris
behavior.
Upon futher reflection I believe this should be allowed under
Linux. The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system. This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#272
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.
Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#425
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.
This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#445
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot. Disown the dataset if SA are expected but
are in fact missing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound. A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.
It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory. By default a kmem_cache will
dynamically determine the type of memory used based on the object
size. For large objects virtual memory is usually preferable
and for small object physical memory is a better choice. See
the spl_slab_alloc() function for a longer discussion on this.
However, there is a certain amount of gray area when defining a
'large' object. For the following caches it turns out they were
just over the line:
* dnode_cache
* zio_cache
* zio_link_cache
* zio_buf_512_cache
* zfs_data_buf_512_cache
Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses. This entirely avoids the
need to serialize on the virtual address space lock.
As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
zpl_putpage() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#431
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/1661
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#426
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Register the setattr/getattr callbacks for symlinks. Without these
the generic inode_setattr() and generic_fillattr() functions will
be used. In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.
The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#412
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#372
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all the symbols for the system attribute (SA) API. This
allows external module to cleanly manipulate the SAs associated
with a dnode. Documention for the SA API can be found in the
module/zfs/sa.c source.
This change also removes the zfs_sa_uprade_pre, and
zfs_sa_uprade_post prototypes. The functions themselves were
dropped some time ago.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Due to the confusion in Linux statfs between f_frsize and f_bsize
the blocks counts were changed to be in units of z_max_blksize
instead of SPA_MINBLOCKSIZE as it is on other platforms.
However, the free files calculation in zfs_statvfs() is limited by
the free blocks count, since each dnode consumes one block/sector.
This provided a reasonable estimate of free inodes, but on Linux
this meant that the free inodes count was underestimated by a large
amount, since 256 512-byte dnodes can fit into a 128kB block, and
more if the max blocksize is increased to 1MB or larger.
Also, the use of SPA_MINBLOCKSIZE is semantically incorrect since
DNODE_SIZE may change to a value other than SPA_MINBLOCKSIZE and
may even change per dataset, and devices with large sectors setting
ashift will also use a larger blocksize.
Correct the f_ffree calculation to use (availbytes >> DNODE_SHIFT)
to more accurately compute the maximum number of dnodes that can
be created.
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#413Closes#400
Export all the symbols for the ZAP API. This allows external modules
to cleanly interface with ZAP type objects. Previously only a subset
of the functionality was exposed. Documention for the ZAP API can be
found in the sys/zap.h header.
This change also removes a duplicate zap_increment_int() prototype.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Suppress the warning for this large kmem_alloc() because it is not
that far over the warning threshhold (8k) and it is short lived.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Caught by code inspection, the variable zsb was referenced after
being freed. Move the kmem_free() to the end of the function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This warning was accidentally introduced by commit
f3ab88d646 which updated the
.readpages() implementation. The fix is to simply cast
the helper function to the appropriate type when passed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unlike the .readpage() callback which is passed a single locked page
to be populated. The .readpages() callback is passed a list of unlocked
pages which are all marked for read-ahead (PG_readahead set). It is
the responsibly of .readpages() to ensure to pages are properly locked
before being populated.
Prior to this change the requested read-ahead pages would be updated
outside of the page lock which is unsafe. The unlocked pages would then
be unlocked again which is harmless but should have been immediately
detected as bug. Unfortunately, newer kernels failed detect this issue
because the check is done with a VM_BUG_ON which is disabled by default.
Luckily, the old Debian Lenny 2.6.26 kernel caught this because it
simply uses a BUG_ON.
The straight forward fix for this is to update the .readpages() callback
to use the read_cache_pages() helper function. The helper function will
ensure that each page in the list is properly locked before it is passed
to the .readpage() callback. In addition resolving the bug, this results
in a nice simplification of the existing code.
The downside to this change is that instead of passing one large read
request to the dmu multiple smaller ones are submitted. All of these
requests however are marked for readahead so the lower layers should
issue a large I/O regardless. Thus most of the request should hit the
ARC cache.
Futher optimization of this code can be done in the future is a perform
analysis determines it to be worthwhile. But for the moment, it is
preferable that code be correct and understandable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#355
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct. In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache. This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.
Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds. However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.
This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean. When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction. The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.
This process is normally entirely asynchronous and will be repeated
for every dirty page. This may initially sound inefficient but most
of these pages will end up in a few txgs. That means when they are
eventually written to disk they should be nicely batched. However,
there is room for improvement. It may still be desirable to batch
up the pages in to larger writes for the dmu. This would reduce
the number of callbacks and small 4k buffer required by the ARC.
Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set. Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The function txg_delay() is used to delay txg (transaction group)
threads in ZFS. The timeout value for this function is calculated
using:
int timeout = ddi_get_lbolt() + ticks;
Later, the actual wait is performed:
while (ddi_get_lbolt() < timeout &&
tx->tx_syncing_txg < txg-1 && !txg_stalled(dp))
(void) cv_timedwait(&tx->tx_quiesce_more_cv, &tx->tx_sync_lock,
timeout - ddi_get_lbolt());
The ddi_get_lbolt() function returns current uptime in clock ticks
and is typed as clock_t. The clock_t type on 64-bit architectures
is int64_t.
The "timeout" variable will overflow depending on the tick frequency
(e.g. for 1000 it will overflow in 28.855 days). This will make the
expression "ddi_get_lbolt() < timeout" always false - txg threads will
not be delayed anymore at all. This leads to a slowdown in ZFS writes.
The attached patch initializes timeout as clock_t to match the return
value of ddi_get_lbolt().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #352
Prior to revision 11314 if a user was recursively destroying
snapshots of a dataset the target dataset was not required to
exist. The zfs_secpolicy_destroy_snaps() function introduced
the security check on the target dataset, so since then if the
target dataset does not exist, the recursive destroy is not
performed. Before 11314, only a delete permission check on
the snapshot's master dataset was performed.
Steps to reproduce:
zfs create pool/a
zfs snapshot pool/a@s1
zfs destroy -r pool@s1
Therefore I suggest to fallback to the old security check, if
the target snapshot does not exist and continue with the destroy.
References to Illumos issue and patch:
- https://www.illumos.org/issues/1043
- https://www.illumos.org/attachments/217/recursive_dataset_destroy.patch
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place. There is a very
good description of the issue and fix in Illumus #883.
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).
This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full. It should instead fail quickly on the full LUNs and
move onto devices which have more availability.
Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Note that with the current ZFS code, it turns out that the vdev
cache is not helpful, and in some cases actually harmful. It
is better if we disable this. Once some time has passed, we
should actually remove this to simplify the code. For now we
just disable it by setting the zfs_vdev_cache_size to zero.
Note that Solaris 11 has made these same changes.
References to Illumos issue and patch:
- https://www.illumos.org/issues/175
- https://github.com/illumos/illumos-gate/commit/b68a40a845
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Hypothesis about what's going on here.
At some time in the past, something, i.e. dnode_reallocate()
calls one of:
dbuf_rm_spill(dn, tx);
These will do:
dbuf_rm_spill(dnode_t *dn, dmu_tx_t *tx)
dbuf_free_range(dn, DMU_SPILL_BLKID, DMU_SPILL_BLKID, tx)
dbuf_undirty(db, tx)
Currently dbuf_undirty can leave a spill block in dn_dirty_records[],
(it having been put there previously by dbuf_dirty) and free it.
Sometime later, dbuf_sync_list trips over this reference to free'd
(and typically reused) memory.
Also, dbuf_undirty can call dnode_clear_range with a bogus
block ID. It needs to test for DMU_SPILL_BLKID, similar to
how dnode_clear_range is called in dbuf_dirty().
References to Illumos issue and patch:
- https://www.illumos.org/issues/764
- https://github.com/illumos/illumos-gate/commit/3f2366c2bb
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Mark.Maybe@oracle.com
Reviewed by: Albert Lee <trisk@nexenta.com
Approved by: Garrett D'Amore <garrett@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Update two kmem_alloc()'s in dbuf_dirty() to use KM_PUSHPAGE.
Because these functions are called from txg_sync_thread we
must ensure they don't reenter the zfs filesystem code via
the .writepage callback. This would result in a deadlock.
This deadlock is rare and has only been observed once under
an abusive mmap() write workload.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies. This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.
Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.
This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code. There have been minor
modifications to accomidate the Linux VFS but that is all.
This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory. This ensures it is created properly with
support for system attributes. Previously it wasn't which
is why it behaved differently that all other directories
when modified.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#116
Newly created files were always being created with the fsuid/fsgid
in the current users credentials. This is correct except in the
case when the parent directory sets the 'setgit' bit. In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory. Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.
Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#262
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels. This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against. This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.
The fix for this issue is to only remove extraneous build products
when DESTDIR is set. This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location. Additionally, limit the removal the unneeded
build products to the target kernel version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#328
Disable the normal reclaim path for zpl_putpage(). This ensures that
all memory allocations under this call path will never enter direct
reclaim. If this were to happen the VM might try to write out
additional pages by calling zpl_putpage() again resulting in a
deadlock.
This sitution is typically handled in Linux by marking each offending
allocation GFP_NOFS. However, since much of the code used is common
it makes more sense to use PF_MEMALLOC to flag the entire call tree.
Alternately, the code could be updated to pass the needed allocation
flags but that's a more invasive change.
The following example of the above described deadlock was triggered
by test 074 in the xfstest suite.
Call Trace:
[<ffffffff814dcdb2>] down_write+0x32/0x40
[<ffffffffa05af6e4>] dnode_new_blkid+0x94/0x2d0 [zfs]
[<ffffffffa0597d66>] dbuf_dirty+0x556/0x750 [zfs]
[<ffffffffa05987d1>] dmu_buf_will_dirty+0x81/0xd0 [zfs]
[<ffffffffa059ee70>] dmu_write+0x90/0x170 [zfs]
[<ffffffffa0611afe>] zfs_putpage+0x2ce/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffffa06287b2>] zpl_writepage+0x12/0x20 [zfs]
[<ffffffff8115f907>] writeout+0xa7/0xd0
[<ffffffff8115fa6b>] move_to_new_page+0x13b/0x170
[<ffffffff8115fed4>] migrate_pages+0x434/0x4c0
[<ffffffff811559ab>] compact_zone+0x4fb/0x780
[<ffffffff81155ed1>] compact_zone_order+0xa1/0xe0
[<ffffffff8115602c>] try_to_compact_pages+0x11c/0x190
[<ffffffff811200bb>] __alloc_pages_nodemask+0x5eb/0x8b0
[<ffffffff8115464a>] alloc_pages_current+0xaa/0x110
[<ffffffff8111e36e>] __get_free_pages+0xe/0x50
[<ffffffffa03f0e2f>] kv_alloc+0x3f/0xb0 [spl]
[<ffffffffa03f11d9>] spl_kmem_cache_alloc+0x339/0x660 [spl]
[<ffffffffa05950b3>] dbuf_create+0x43/0x370 [zfs]
[<ffffffffa0596fb1>] __dbuf_hold_impl+0x241/0x480 [zfs]
[<ffffffffa0597276>] dbuf_hold_impl+0x86/0xc0 [zfs]
[<ffffffffa05977ff>] dbuf_hold_level+0x1f/0x30 [zfs]
[<ffffffffa05a9dde>] dmu_tx_check_ioerr+0x4e/0x110 [zfs]
[<ffffffffa05aa1f9>] dmu_tx_count_write+0x359/0x6f0 [zfs]
[<ffffffffa05aa5df>] dmu_tx_hold_write+0x4f/0x70 [zfs]
[<ffffffffa0611a6d>] zfs_putpage+0x23d/0x360 [zfs]
[<ffffffffa062875e>] zpl_putpage+0x1e/0x60 [zfs]
[<ffffffff811221f9>] write_cache_pages+0x1c9/0x4a0
[<ffffffffa0628738>] zpl_writepages+0x18/0x20 [zfs]
[<ffffffff81122521>] do_writepages+0x21/0x40
[<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
[<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
[<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
[<ffffffff8119c6cb>] wb_do_writeback+0xbb/0x240
[<ffffffff811308ea>] bdi_forker_task+0x6a/0x310
[<ffffffff8108ddf6>] kthread+0x96/0xa0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#327
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages. Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.
The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case. The individual pages will still be locked
thus serializing access. Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do. This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.
This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks. It's up to each filesystem
to perform its own file locking. Under Linux the VFS provides many
of these services.
It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
This commit fixes a regression which was accidentally introduced by
the Linux 2.6.39 compatibility chanages. As part of these changes
instead of holding an active reference on the namepsace (which is
no longer posible) a reference is taken on the super block. This
reference ensures the super block remains valid while it is in use.
To handle the unlikely race condition of the filesystem being
unmounted concurrently with the start of a 'zfs send/recv' the
code was updated to only take the super block reference when there
was an existing reference. This indicates that the filesystem is
active and in use.
Unfortunately, in the 'zfs recv' case this is not the case. The
newly created dataset will not have a super block without an
active reference which results in the 'dataset is busy' error.
The most straight forward fix for this is to simply update the
code to always take the reference even when it's zero. This
may expose us to very very unlikely concurrent umount/send/recv
case but the consequences of that are minor.
Closes#319
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper. However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date. Unfortunately, this isn't
the case for the blksize, blocks, and atime fields. At the
moment the authoritative values are still stored in the znode.
This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode. At some
latter date we should be able to strictly use the inode values
and further improve performance.
The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex. This overhead is
unavoidable until the inode is kept strictly up to date. The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros. These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation. However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.
=================== Performance Tests ========================
This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop. The test results
show the zfs stat(2) performance is now only 22% slower than
ext4. This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.
filesystem | test-1 test-2 test-3 | average | times-ext4
--------------+-------------------------+---------+-----------
ext4 | 7.785s 7.899s 7.284s | 7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat | 9.224s 9.398s 9.485s | 9.369s | 1.223x
The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files. The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache. As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.
A little surprisingly the zfs cold cache performance is better
than ext4. This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times. By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.
filesystem | cold | hot
--------------+---------+--------
ext4 | 13.318s | 1.040s
zfs-0.6.0-rc4 | 4.982s | 1.762s
zfs-faststat | 4.933s | 1.345s
The performance of the L2ARC can be tweaked by a number of tunables, which
may be necessary for different workloads:
l2arc_write_max max write bytes per interval
l2arc_write_boost extra write bytes during device warmup
l2arc_noprefetch skip caching prefetched buffers
l2arc_headroom number of max device writes to precache
l2arc_feed_secs seconds between L2ARC writing
l2arc_feed_min_ms min feed interval in milliseconds
l2arc_feed_again turbo L2ARC warmup
l2arc_norw no reads during writes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#316
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.
On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux you may only disable USER xattrs. The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem. The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.
Under Solaris there is only a single xattr namespace which
is managed globally.
The Linux kernel already has support for mandatory locking. This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants. In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS. However, for now we'll leave it in place even
if it is redundant just in case we missed something.
The original patch to update the code to support mandatory locking
was done by Rohan Puri. This patch is an updated version which is
compatible with the previous mount option handling changes.
Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#222Closes#253
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
Under Linux the VFS handles virtually all of the mmap() access
checks. Filesystem specific checks are left to be handled in
the .mmap() hook and normally there arn't any.
However, ZFS provides a few attributes which can influence the
mmap behavior and should be honored. Note, currently the code
to modify these attributes has not been implemented under Linux.
* ZFS_IMMUTABLE | ZFS_READONLY | ZFS_APPENDONLY: when any of these
attributes are set a file may not be mmaped with write access.
* ZFS_AV_QUARANTINED: when set a file file may not be mmaped with
read or exec access.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The following functions were required for the OpenSolaris mmap
implementation. Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.
* zfs_null_putapage()
* zfs_frlock()
* zfs_no_putpage()
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Enable zfs_getpage, zfs_fillpage, zfs_putpage, zfs_putapage functions.
The functions have been modified to make them Linux friendly.
ZFS uses these functions to read/write the mmapped pages. Using them
from readpage/writepage results in clear code. The patch also adds
readpages and writepages interface functions to read/write list of
pages in one function call.
The code change handles the first mmap optimization mentioned on
https://github.com/behlendorf/zfs/issues/225
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
Issue #255
According to Linux kernel commit 2c27c65e, using truncate_setsize in
setattr simplifies the code. Therefore, the patch replaces the call
to vmtruncate() with truncate_setsize().
zfs_setattr uses zfs_freesp to free the disk space belonging to the
file. As truncate_setsize may release the page cache and flushing
the dirty data to disk, it must be called before the zfs_freesp.
Suggested-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255
To accomindate the updated Linux 3.0 shrinker API the spl
shrinker compatibility code was updated. Unfortunately, this
couldn't be done cleanly without slightly adjusting the comapt
API. See spl commit a55bcaad18.
This commit updates the ZFS code to use the slightly modified
API. You must use the latest SPL if your building ZFS.
The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:
1. File is created.
2. xattr is created for that file (behind the scenes a xattr
directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
inode from that directory are evicted by prune_icache();
prune_icache() acquires a lock on both inodes before it
calls ->evict() on the inodes
When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.
This can be reliably reproduced like this:
$ touch test
$ attr -s a -V b test
$ rm test
$ echo 3 > /proc/sys/vm/drop_caches
This patch fixes the deadlock by moving the zfs_purgedir() call to
zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the
xattr dir is empty and leaves the xattr dir in the unlinked set if
it finds any xattrs.
To ensure zfs_unlinked_drain() never accesses a stale super block
zfsvfs_teardown() has been update to block until the iput taskq
has been drained. This avoids a potential race where a file with
an xattr directory is removed and the file system is immediately
unmounted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#266
iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
for us after zfs_zinactive(), thus making sure that the inode is
properly cleaned up.
The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
double-free.
Fixes#282
Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
suffer from bad write performance when ashift is not configured
correctly. This is caused by the disk not reporting its actual sector
size, but a sector size of 512 bytes. The drive may behave this way
for compatibility reasons. For example, the WDC WD20EARS disks are
known to exhibit this behavior.
When creating a zpool, ZFS takes that wrong sector size and sets the
"ashift" property accordingly (to 9: 1<<9=512), whereas it should be
set to 12 for 4k sectors (1<<12=4096).
This patch allows an adminstrator to manual specify the known correct
ashift size at 'zpool create' time. This can significantly improve
performance in certain cases. However, it will have an impact on your
total pool capacity. See the updated ashift property description
in the zpool.8 man page for additional details.
Valid values for the ashift property range from 9 to 17 (512B-128KB).
Additionally, you may set the ashift to 0 if you wish to auto-detect
the sector size based on what the disk reports, this is the default
behavior. The most common ashift values are 9 and 12.
Example:
zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
Closes#280
Original-patch-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
Stack usage for ddt_class_contains() reduced from 524 bytes to 68
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
bytes. This large stack allocation significantly contributed to
the likelyhood of a stack overflow when scrubbing/resilvering
dedup pools.
This abomination is no longer required because the zio's issued
during this recursive call path will now be handled asynchronously
by the taskq thread pool.
This reverts commit 6656bf5621.
The majority of the recursive operations performed by the dsl
are done either in the context of the tgx_sync_thread or during
pool import. It is these recursive operations which contribute
greatly to the stack depth. When this recursion is coupled with
a synchronous I/O in the same context overflow becomes possible.
Previously to handle this case I have focused on keeping the
individual stack frames as light as possible. This is a good
idea as long as it can be done in a way which doesn't overly
complicate the code. However, there is a better solution.
If we treat all zio's issued by the tgx_sync_thread as async then
we can use the tgx_sync_thread stack for the recursive parts, and
the zio_* threads for the I/O parts. This effectively doubles our
available stack space with the only drawback being a small delay
to schedule the I/O. However, in practice the scheduling time
is so much smaller than the actual I/O time this isn't an issue.
Another benefit of making the zio async is that the zio pipeline
is now parallel. That should mean for CPU intensive pipelines
such as compression or dedup performance may be improved.
With this change in place the worst case stack usage observed so
far is 6902 bytes. This is still higher than I'd like but
significantly improved. Additional changes to specific functions
should improve this further. This change allows us to revent
commit 6656bf5 which did some horrible things to the recursive
traverse_visitbp() callpath in the name of saving stack.
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size. In practice this works out
to exactly 27200 bytes. Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented. So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.
However, one exception is quota handling which does require the
credential. Now that proper credentials are supported we can
safely start passing the callers credential. This is also an
initial step towards fully implemented the zfs secpolicy.
Normally when the arc_shrinker_func() function is called the return
value should be:
>=0 - To indicate the number of freeable objects in the cache, or
-1 - To indicate this cache should be skipped
However, when the shrinker callback is called with 'nr_to_scan' equal
to zero. The caller simply wants the number of freeable objects in
the cache and we must never return -1. This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.
This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected. As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.
Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min. This is done to prevent the Linux page cache from
completely crowding out the ARC. This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.
Closes#218Closes#243
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux. The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped. This differs from Solaris where zfs_close()
is called for each close.
Closes#237
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute. While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets. Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.
Closes#216
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels. But basically there are three cases we need to
consider.
Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.
Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
For once it looks like we've gotten lucky. The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely. Since the dentry is still passed in both cases
this is possible. The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.
Closes#230
The default buffer size when requesting history is 128k. This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc(). This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values. This was done
for the znode and for most of the values in the inode, but not all
of them. This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated. This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.
Closes#116Closes#146Closes#164
This change fixes a kernel panic which would occur when resizing
a dataset which was not open. The objset_t stored in the
zvol_state_t will be set to NULL when the block device is closed.
To avoid this issue we pass the correct objset_t as the third arg.
The code has also been updated to correctly notify the kernel
when the block device capacity changes. For 2.6.28 and newer
kernels the capacity change will be immediately detected. For
earlier kernels the capacity change will be detected when the
device is next opened. This is a known limitation of older
kernels.
Online ext3 resize test case passes on 2.6.28+ kernels:
$ dd if=/dev/zero of=/tmp/zvol bs=1M count=1 seek=1023
$ zpool create tank /tmp/zvol
$ zfs create -V 500M tank/zd0
$ mkfs.ext3 /dev/zd0
$ mkdir /mnt/zd0
$ mount /dev/zd0 /mnt/zd0
$ df -h /mnt/zd0
$ zfs set volsize=800M tank/zd0
$ resize2fs /dev/zd0
$ df -h /mnt/zd0
Original-patch-by: Fajar A. Nugraha <github@fajar.net>
Closes#68Closes#84
The vdev_metaslab_init() function has been observed to allocate
larger than 8k chunks. However, they are not much larger than 8k
and it does this infrequently so it is allowed and the warning is
supressed.
The dsl_scan_visit() function is a little heavy weight taking 464
bytes on the stack. This can be easily reduced for little cost by
moving zap_cursor_t and zap_attribute_t off the stack and on to the
heap. After this change dsl_scan_visit() has been reduced in size
by 320 bytes.
This change was made to reduce stack usage in the dsl_scan_sync()
callpath which is recursive and has been observed to overflow the
stack.
Issue #174
This function is called recursively so everything possible must be
done to limit its stack consumption. The dprintf_bp() debugging
function adds 30 bytes of local variables to the function we cannot
afford. By commenting out this debugging we save 30 bytes per
recursion and depths of 13 are not uncommon. This yeilds a total
stack saving of 390 bytes on our 8k stack.
Issue #174
The recursive call chain dsl_scan_visitbp() -> dsl_scan_recurse() ->
dsl_scan_visitdnode() -> dsl_scan_visitbp has been observed to consume
considerable stack resulting in a stack overflow (>8k). The cleanest
way I see to fix this with minimal impact to the existing flow of
code, and with the fewest performance concerns, is to always inline
dsl_scan_recurse() and dsl_scan_visitdnode(). While this will increase
the function size of dsl_scan_visitbp(), by 4660 bytes, it also reduces
the stack requirements by removing the function call overhead.
Issue #174
It's possible for a zvol_write thread to enter direct memory reclaim
while holding open a transaction group. This results in the system
attempting to write out data to the disk to free memory. Unfortunately,
this can't succeed because the the thread doing reclaim is holding open
the txg which must be closed to be synced to disk. To prevent this
the offending allocation is marked KM_PUSHPAGE which will prevent it
from attempting writeback.
Closes#191
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev. This was caused an improperly formatted
user mode helper command.
This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.
Closes#169
This change ensures the ARC meta-data limits are enforced. Without
this enforcement meta-data can grow to consume all of the ARC cache
pushing out data and hurting performance. The cache is aggressively
reclaimed but this is a soft and not a hard limit. The cache may
exceed the set limit briefly before being brought under control.
By default 25% of the ARC capacity can be used for meta-data. This
limit can be tuned by setting the 'zfs_arc_meta_limit' module option.
Once this limit is exceeded meta-data reclaim will occur in 3 percent
chunks, or may be tuned using 'arc_reduce_dnlc_percent'.
Closes#193
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#180
As part of zfs_ioc_recv() a zfs_cmd_t is allocated in the kernel
which is 17808 bytes in size. This sort of thing in general should
be avoided. However, since this should be an infrequent event for
now we allow it and simply suppress the warning with the KM_NODEBUG
flag. This can be revisited latter if/when it becomes an issue.
Closes#178
If the attribute's new value was shorter than the old one the old
code would leave parts of the old value in the xattr znode.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#203
Without this we may mistakenly believe we have a dentry and try to
d_instantiate() it. This will result in the following BUG. It's
important to note that while the xattr directory has an inode
assoicated with it we never create a dentry for it.
kernel BUG at fs/dcache.c:1418!
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#202
When compiling ZFS in user space gcc-4.6.0 correctly identifies
the variable 'os' as being set but never used. This generates a
warning and a build failure when using --enable-debug. However,
the code is correct we only want to use 'os' for the kernel space
builds. To suppress the warning the call was wrapped with a
VERIFY() which has the nice side effect of ensuring the 'os'
actually never is NULL. This was observed under Fedora 15.
module/zfs/dsl_pool.c: In function ‘dsl_pool_create’:
module/zfs/dsl_pool.c:229:12: error: variable ‘os’ set but not used
[-Werror=unused-but-set-variable]
Update code to use the spl_invalidate_inodes() wrapper. This hides
some of the complexity of determining if invalidate_inodes() was
exported, and if so what is its prototype. The second argument
of spl_invalidate_inodes() determined the behavior of how dirty
inodes are handled. By passing a zero we are indicated that we
want those inodes to be treated as busy and skipped.
The .sync_fs fix as applied did not use the updated SPL credential
API. This broke builds on Debian Lenny, this change applies the
needed fix to use the portable API. The original credential changes
are part of commit 81e97e2187.
Disable the normal reclaim path for the txg_sync thread. This
ensures the thread will never enter dmu_tx_assign() which can
otherwise occur due to direct reclaim. If this is allowed to
happen the system can deadlock. Direct reclaim call path:
->shrink_icache_memory->prune_icache->dispose_list->
clear_inode->zpl_clear_inode->zfs_inactive->dmu_tx_assign
Under OpenSolaris all memory reclaim is done asyncronously. Under
Linux memory reclaim is done asynchronously _and_ synchronously.
When a process allocates memory with GFP_KERNEL it explicitly allows
the kernel to do reclaim on its behalf to satify the allocation.
If that GFP_KERNEL allocation fails the kernel may take more drastic
measures to reclaim the memory such as killing user space processes.
This was observed to happen with ZFS because the ARC could consume
a large fraction of the system memory but no synchronous reclaim
could be performed on it. The result was GFP_KERNEL allocations
could fail resulting in OOM events, and only moments latter the
arc_reclaim thread would free unused memory from the ARC.
This change leaves the arc_thread in place to manage the fundamental
ARC behavior. But it adds a synchronous (direct) reclaim path for
the ARC which can be called when memory is badly needed. It also
adds an asynchronous (indirect) reclaim path which is called
much more frequently to prune the ARC slab caches.
The following useful values were missing the arcstats. This change
adds them in to provide greater visibility in to the arcs behavior.
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_meta_used 4 624774592
arc_meta_limit 4 400785408
arc_meta_max 4 625594176
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked. To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t. There are
cases such as replay when a dentry is not available. In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
Making distclean in module
make[1]: Entering directory `/zfs/module'
make -C SUBDIRS=`pwd` clean
make: Entering an unknown directory
make: *** SUBDIRS=/zfs/module: No such file or directory. Stop.
When using --with-config=user the 'distclean' target would fail
because it assumes the kernel configuration infrastrure is set up.
This is not the case, nor does it need to be, because the
'--with-config=user' option will prune the entire ./module subtree
from SUBDIRS. This prevents most build rules from operating in the
./module directory.
However, the 'dist*' rules will still traverse this directory
because it is listed in DIST_SUBDIRS. This is correct because we
need to ensure the dist rules package the directory contents
regardless of the configuration for the 'dist' rule. The correct
way to handle this is to only invoke the kernel build system as
part of the 'clean' rule when CONFIG_KERNEL_TRUE is set.
Initial fix provided by Darik Horn <dajhorn@vanadac.com>.
This commit is a slightly refined form of the original.
Kernel threads which sleep uninterruptibly on Linux are marked in the (D)
state. These threads are usually in the process of performing IO and are
thus counted against the load average. The txg_quiesce and txg_sync threads
were always sleeping uninterruptibly and thus inflating the load average.
This change makes them sleep interruptibly. Some care is required however
because these threads may now be woken early by signals. In this case the
callers are all careful to check that the required conditions are met after
waking up. If we're woken early due to a signal they will simply go back
to sleep. In this case these changes are safe.
Closes#175
The .freeze_fs/.unfreeze_fs hooks were not added until Linux 2.6.29
Since these hooks are currently unused they are being removed to
allow support of older kernels.
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct. Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks. These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously. Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.
This commit addresses a deadlock caused by inode eviction. A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held. Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock. To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.
This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated. This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.
However, this increases the posibility of deadlocking the system
on a zfs write thread. If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim. This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.
To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread. In this way forward progress in ensured. The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock. If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
Register the missing .remount_fs handler. This handler isn't strictly
required because the VFS does a pretty good job updating most of the
MS_* flags. However, there's no harm in using the hook to call the
registered zpl callback for various MS_* flags. Additionaly, this
allows us to lay the ground work for more complicated argument parsing
in the future.
Register the missing .sync_fs handler. This is a noop in most cases
because the usual requirement is that sync just be initiated. As part
of the DMU's normal transaction processing txgs will be frequently
synced. However, when the 'wait' flag is set the requirement is that
.sync_fs must not return until the data is safe on disk. With the
addition of the .sync_fs handler this is now properly implemented.
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk. This is basically the same logic as
adjusting the write cache behavior on a disk. This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.
Closes#152
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it. This
was introduced accidentally when I removed the Solaris vnodes.
The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object. With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.
Unfortunately, this value was never properly initialized after
the vnode removal. However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy. Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.
Closes#126
Move 'bulk' and 'xattr_bulk' from the stack to the heap to minimize
stack space usage. These two arrays consumed 448 bytes on the stack
and have been replaced by two 8 byte points for a total stack space
saving of 432 bytes. The zfs_setattr() path had been previously
observed to overrun the stack in certain circumstances.
The original range lock implementation had to be modified by commit
8926ab7 because it was unsafe on Linux. In particular, calling
cv_destroy() immediately after cv_broadcast() is dangerous because
the waiters may still be asleep. Thus the following cv_destroy()
will free memory which may still be in use.
This was fixed by updating cv_destroy() to block on waiters but
this in turn introduced a deadlock. The deadlock was resolved
with the use of a taskq to move the offending free outside the
range lock. This worked well but using the taskq for the free
resulted in a serious performace hit. This is somewhat ironic
because at the time I felt using the taskq might improve things
by making the free asynchronous.
This patch refines the original fix and moves the free from the
taskq to a private free list. Then items which must be free'd
are simply inserted in to the list. When the range lock is dropped
it's safe to free the items. The list is walked and all rl_t
entries are freed.
This change improves small cached read performance by 26x. This
was expected because for small reads the number of locking calls
goes up significantly. More surprisingly this change significantly
improves large cache read performance. This probably attributable
to better cpu/memory locality. Very likely the same processor
which allocated the memory is now freeing it.
bs ext3 zfs zfs+fix faster
----------------------------------------------
512 435 3 79 26x
1k 820 7 160 22x
2k 1536 14 305 21x
4k 2764 28 572 20x
8k 3788 50 1024 20x
16k 4300 86 1843 21x
32k 4505 138 2560 18x
64k 5324 252 3891 15x
128k 5427 276 4710 17x
256k 5427 413 5017 12x
512k 5427 497 5324 10x
1m 5427 521 5632 10x
Closes#142
In the original implementation the zfs_open()/zfs_close() hooks
were dropped for simplicity. This was functional but not 100%
correct with the expected ZFS sematics. Updating and re-adding the
zfs_open()/zfs_close() hooks resolves the following issues.
1) The ZFS_APPENDONLY file attribute is once again honored. While
there are still no Linux tools to set/clear these attributes once
there are it should behave correctly.
2) Minimal virus scan file attribute hooks were added. Once again
this support in disabled but the infrastructure is back in place.
3) Most importantly correctly handle assigning files which were
opened syncronously to the intent log. Without this change O_SYNC
modifications could be lost during a system crash even though they
were marked synchronous.
Filesystems like ZFS must use what the kernel calls an anonymous super
block. Basically, this is just a filesystem which is not backed by a
single block device. Normally this block device's dev_t is stored in
the super block. For anonymous super blocks a unique reserved dev_t
is assigned as part of get_sb().
This sb->s_dev must then be set in the returned stat structures as
stat->st_dev. This allows userspace utilities to easily detect the
boundries of a specific filesystem. Tools such as 'du' depend on this
for proper accounting.
Additionally, under OpenSolaris the statfs->f_fsid is set to the device
id. To preserve consistency with OpenSolaris we also set the fsid to
the device id. Other Linux filesystem (ext) set the fsid to a unique
value determined by the filesystems uuid. This value is unique but
maintains no relationship to the device id. This may be desirable
when exporting NFS filesystem because it minimizes to chance of a
client observing the same fsid from two different servers.
Closes#140
The AT_ versions of these macros are used on Solaris and while they
map to their Linux equivilants the code has been updated to use the
ATTR_ versions.
Move 'tmpxvattr' from the stack to the heap to minimize stack
space usage. This is enough to get us below the 1024 byte stack
frame warning. That however is still a large stack frame and it
should be further reduced by moving the 'bulk' and 'xattr_bulk'
sa_bulk_attr_t variables to the heap in a future patch.
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go. One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible. They
would be replaced with their Linux equivalents. This would not only
be good for performance, but for the general readability and health of
the code. The Solaris and Linux VFS are different beasts and should
be treated as such. Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.
This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t. There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR. But it didn't look that hard to come
back soon and replace it all with a native Linux type.
However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.
Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things. This commit reverts many of
my previous commits which removed xvattr related code. It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.
The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working. However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types. For now that's
a price I'm willing to pay. Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.
Closes#111
Print the supported zpool and filesystem versions at module load
time. This change removes an ambiguity and adds information that
system administrators care about. The phrase "ZFS pool version %s"
is the same as zpool upgrade -v so that the operator is familiar
with the message.
ZFS: Loaded module v0.6.0, ZFS pool version 28, ZFS filesystem version 5
ZFS: Unloaded module v0.6.0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.
The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler. In
these cases attempting to set a scheduler is pointless and can
be safely skipped.
The secord case is slightly more troubling. We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible. To handle that case we silently retry
up to three times before reporting the warning.
In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
This commit allows zvols with names longer than 32 characters, which
fixes issue on https://github.com/behlendorf/zfs/issues/#issue/102.
Changes include:
- use /dev/zd* device names for zvol, where * is the device minor
(include/sys/fs/zfs.h, module/zfs/zvol.c).
- add BLKZNAME ioctl to get dataset name from userland
(include/sys/fs/zfs.h, module/zfs/zvol.c, cmd/zvol_id).
- add udev rule to create /dev/zvol/[dataset_name] and the legacy
/dev/[dataset_name] symlink. For partitions on zvol, it will create
/dev/zvol/[dataset_name]-part* (etc/udev/rules.d/60-zvol.rules,
cmd/zvol_id).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove custom code to pack/unpack dev_t's. Under Linux all dev_t's
are an unsigned 32-bit value even on 64-bit platforms. The lower
20 bits are used for the minor number and the upper 12 for the major
number.
This means if your importing a pool from Solaris you may get strange
major/minor numbers. But it doesn't really matter because even if
we add compatibility code to translate the encoded Solaris major/minor
they won't do you any good under Linux. You will still need to
recreate the dev_t with a major/minor which maps to reserved major
numbers used under Linux.
Dropping this code also resolves 32-bit builds by removing the
offending 32-bit compatibility code.
ASSERT3P should be used instead of ASSERT3U when comparing
pointers. Using ASSERT3U with the cast causes a compiler
warning for 32-bit builds which is fatal with --enable-debug.
The underlying storage pool actually uses multiple block
size. Under Solaris frsize (fragment size) is reported as
the smallest block size we support, and bsize (block size)
as the filesystem's maximum block size. Unfortunately,
under Linux the fragment size and block size are often used
interchangeably. Thus we are forced to report both of them
as the filesystem's maximum block size.
Closes#112
Because the secpolicy_* macros are all currently defined to (0).
And because the caller of this function does not check the return
code. The compiler complains that this statement has no effect
which is correct and OK. To suppress the warning explictly cast
the result to (void).
Generally it's a good idea to use enums for switch statements,
but in this case it causes warning because the enum is really a
set of flags. These flags are OR'ed together in some cases
resulting in values which are not part of the original enum.
This causes compiler warning such as this about invalid cases.
error: case value ‘33’ not in enumerated type ‘zprop_source_t’
To handle this we simply case the enum to an int for the switch
statement. This leaves all other enum type checking in place
and effectively disabled these warnings.
For legacy reasons the zvol.c and vdev_disk.c Linux compatibility
code ended up in sys/blkdev.h and sys/vdev_disk.h headers. While
there are worse places for this code to live it should be in a
linux/blkdev_compat.h header. This change moves this block device
Linux compatibility code in to the linux/blkdev_compat.h header
and updates all the correct #include locations. This is not a
functional change or bug fix, it is just code cleanup.
When changing the uid/gid of a file via zfs_setattr() use the
Posix id passed in iattr->ia_uid/gid. While the zfs_fuid_create()
code already had the fuid support disabled for Linux it was
returning the uid/gid from the credential. With this change
the 'chown' command which relies on setxattr is now working
properly.
Also remove a little stray white space which was in front of
zfs_update_inode() call and the end of zfs_setattr().
Under Linux sys_symlink(2) should result in a inode being created
with one reference for the inode itself, and a second reference on
the inode which is held by the new dentry. Under Solaris this
appears not to be the case. Their zfs_symlink() handler drops
the inode reference before returning.
The result of this under Linux is that the reference count for
symlinks is always one smaller than it should have been. This
results in a BUG() when the symlink is unlinked. To handle this
the Linux port now keeps the inode reference which differs from
the Solaris behavior. This results in correct reference counts.
Closes#96
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
This patch addresses three issues related to symlinks.
1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink(). The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t. The
caller zpl_follow_link() was also updated accordingly. This
change was reverted because it was slightly gratuitious.
2) Update zpl_follow_link() to use local variables for the
link buffer. I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used. We need our own private copy of the link pointer.
3) Allocate MAXPATHLEN instead of MAXPATHLEN+1. By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page. That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
This adds an API to wait for pending commit callbacks of already-synced
transactions to finish processing. This is needed by the DMU-OSD in
Lustre during device finalization when some callbacks may still not be
called, this leads to non-zero reference count errors. See lustre.org
bug 23931.
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported. This patch registers these additional callbacks
for both directory and special inode types.
Under Linux when creating a fifo or socket type device in the ZFS
filesystem it's critical that the rdev is stored in a SA. This
was already being correctly done for character and block devices,
but that logic needed to be extended to include FIFOs and sockets.
This patch takes care of device creation but a follow on patch
may still be required to verify that the dev_t is being correctly
packed/unpacked from the SA.
It was noticed that when you have zvols in multiple datasets
not all of the zvol devices are created at module load time.
Fajarnugraha did the leg work to identify that the root cause of
this bug is a non-zero return value from zvol_create_minors_cb().
Returning a non-zero value from the dmu_objset_find_spa() callback
function results in aborting processing the remaining children in
a dataset. Since we want to ensure that the callback in run on
all children regardless of error simply unconditionally return
zero from the zvol_create_minors_cb(). This callback function
is solely used for this purpose so surpressing the error is safe.
Closes#96
The new prefered inteface for evicting an inode from the inode cache
is the ->evict_inode() callback. It replaces both the ->delete_inode()
and ->clear_inode() callbacks which were previously used for this.
The xattr handler prototypes were sanitized with the idea being that
the same handlers could be used for multiple methods. The result of
this was the inode type was changes to a dentry, and both the get()
and set() hooks had a handler_flags argument added. The list()
callback was similiarly effected but no autoconf check was added
because we do not use the list() callback.
The fsync() callback in the file_operations structure used to take
3 arguments. The callback now only takes 2 arguments because the
dentry argument was determined to be unused by all consumers. To
handle this a compatibility prototype was added to ensure the right
prototype is used. Our implementation never used the dentry argument
either so it's just a matter of using the right prototype.
The const keyword was added to the 'struct xattr_handler' in the
generic Linux super_block structure. To handle this we define an
appropriate xattr_handler_t typedef which can be used. This was
the preferred solution because it keeps the code clean and readable.
Initial testing has shown the the right IO scheduler to use under Linux
is noop. This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization. While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device. This yields the largest possible requests for the
device with the lowest total overhead.
While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option. You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory. In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
The following warning was observed under normal operation. It's
not fatal but it's something to be addressed long term. Flag the
offending allocation with KM_NODEBUG to suppress the warning and
flag the call site.
SPL: Showing stack for process 21761
Pid: 21761, comm: iozone Tainted: P ----------------
2.6.32-71.14.1.el6.x86_64 #1
Call Trace:
[<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl]
[<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl]
[<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs]
[<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs]
[<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs]
[<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs]
[<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs]
[<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs]
[<ffffffff8116d905>] vfs_read+0xb5/0x1a0
[<ffffffff8116da41>] sys_read+0x51/0x90
[<ffffffff81013172>] system_call_fastpath+0x16/0x1b
When performing a 'zfs rollback' it's critical to invalidate
the previous dcache and inode cache. If we don't there will
stale cache entries which when accessed will result in EIOs.
With the recent SPL change (d599e4fa) that forces cv_destroy()
to block until all waiters have been woken. It is now unsafe
to call cv_destroy() under the zp->z_range_lock() because it
is used as the condition variable mutex. If there are waiters
cv_destroy() will block until they wake up and aquire the mutex.
However, they will never aquire the mutex because cv_destroy()
will not return allowing it's caller to drop the lock. Deadlock.
To avoid this cv_destroy() is now run asynchronously in a taskq.
This solves two problems:
1) It is no longer run under the zp->z_range_lock so no deadlock.
2) Since cv_destroy() may now block we don't want this slowing
down zfs_range_unlock() and throttling the system.
This was not as much of an issue under OpenSolaris because their
cv_destroy() implementation does not do anything. They do however
risk a bad paging request if cv_destroy() returns, the memory holding
the condition variable is free'd, and then the waiters wake up and
try to reference it. It's a very small unlikely race, but it is
possible.
It's worth taking a moment to describe how mmap is implemented
for zfs because it differs considerably from other Linux filesystems.
However, this issue is handled the same way under OpenSolaris.
The issue is that by design zfs bypasses the Linux page cache and
leaves all caching up to the ARC. This has been shown to work
well for the common read(2)/write(2) case. However, mmap(2)
is problem because it relies on being tightly integrated with the
page cache. To handle this we cache mmap'ed files twice, once in
the ARC and a second time in the page cache. The code is careful
to keep both copies synchronized.
When a file with an mmap'ed region is written to using write(2)
both the data in the ARC and existing pages in the page cache
are updated. For a read(2) data will be read first from the page
cache then the ARC if needed. Neither a write(2) or read(2) will
will ever result in new pages being added to the page cache.
New pages are added to the page cache only via .readpage() which
is called when the vfs needs to read a page off disk to back the
virtual memory region. These pages may be modified without
notifying the ARC and will be written out periodically via
.writepage(). This will occur due to either a sync or the usual
page aging behavior. Note because a read(2) of a mmap'ed file
will always check the page cache first even when the ARC is out
of date correct data will still be returned.
While this implementation ensures correct behavior it does have
have some drawbacks. The most obvious of which is that it
increases the required memory footprint when access mmap'ed
files. It also adds additional complexity to the code keeping
both caches synchronized.
Longer term it may be possible to cleanly resolve this wart by
mapping page cache pages directly on to the ARC buffers. The
Linux address space operations are flexible enough to allow
selection of which pages back a particular index. The trick
would be working out the details of which subsystem is in
charge, the ARC, the page cache, or both. It may also prove
helpful to move the ARC buffers to a scatter-gather lists
rather than a vmalloc'ed region.
Additionally, zfs_write/read_common() were used in the readpage
and writepage hooks because it was fairly easy. However, it
would be better to update zfs_fillpage and zfs_putapage to be
Linux friendly and use them instead.
The Linux specific xattr operations have all been located in the
file zpl_xattr.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific super block operations have all been located in the
file zpl_super.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific inode operations have all been located in the
file zpl_inode.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
The Linux specific file operations have all been located in the
file zpl_file.c. These functions primarily rely on the reworked
zfs_* functions to do their job. They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
This first zpl_* commit also includes a common zpl.h header with
minimal entries to register the Linux specific hooks. In also
adds all the new zpl_* file to the Makefile.in. This is not a
standalone commit, you required the following zpl_* commits.
For the moment exactly how to handle xvattr is not clear. This
change largely consists of the code to comment out the offending
bits until something reasonable can be done.
A new flag is required for the zfs_rlock code to determine if
it is operation of the zvol of zpl dataset. This used to be
keyed off the zp->z_vnode, which was a hack to begin with, but
with the removal of vnodes we needed a dedicated flag.
I appologize in advance why to many things ended up in this commit.
When it could be seperated in to a whole series of commits teasing
that all apart now would take considerable time and I'm not sure
there's much merrit in it. As such I'll just summerize the intent
of the changes which are all (or partly) in this commit. Broadly
the intent is to remove as much Solaris specific code as possible
and replace it with native Linux equivilants. More specifically:
1) Replace all instances of zfsvfs_t with zfs_sb_t. While the
type is largely the same calling it private super block data
rather than a zfsvfs is more consistent with how Linux names
this. While non critical it makes the code easier to read when
your thinking in Linux friendly VFS terms.
2) Replace vnode_t with struct inode. The Linux VFS doesn't have
the notion of a vnode and there's absolutely no good reason to
create one. There are in fact several good reasons to remove it.
It just adds overhead on Linux if we were to manage one, it
conplicates the code, and it likely will lead to bugs so there's
a good change it will be out of date. The code has been updated
to remove all need for this type.
3) Replace all vtype_t's with umode types. Along with this shift
all uses of types to mode bits. The Solaris code would pass a
vtype which is redundant with the Linux mode. Just update all the
code to use the Linux mode macros and remove this redundancy.
4) Remove using of vn_* helpers and replace where needed with
inode helpers. The big example here is creating iput_aync to
replace vn_rele_async. Other vn helpers will be addressed as
needed but they should be be emulated. They are a Solaris VFS'ism
and should simply be replaced with Linux equivilants.
5) Update znode alloc/free code. Under Linux it's common to
embed the inode specific data with the inode itself. This removes
the need for an extra memory allocation. In zfs this information
is called a znode and it now embeds the inode with it. Allocators
have been updated accordingly.
6) Minimal integration with the vfs flags for setting up the
super block and handling mount options has been added this
code will need to be refined but functionally it's all there.
This will be the first and last of these to large to review commits.
For the moment we do not use dmu_write_pages() to write pages
directly in to a dmu object. It may be required at some point
in the future, but for now is simplest and cleanest to drop it.
It can be easily readded if/when needed.
For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS. We are committing to this to simply the
filesystem creations code.
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement. These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered. As the comment says 'data integrity is job one'. This
change could have been done by defining panicstr to oops_in_progress
in the SPL. But I felt it was better to use the native Linux API
here since to be clear.
This flag does not need to be support under Linux. As the comment
says it was only there to support fsflush() for old filesystem like
UFS. This is not needed under Linux.
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer. Honoring those mount
options once set if of course the responsibility of the lower
layers.
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted. Once again the generic
Linux VFS handles this case for us so it can be removed.
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer. They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
Under Linux we don't need to reserve a major or minor number for
the filesystem. We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.
Additionally, there is no need to keep a zfsfstype around. We are
not limited on Linux by the OpenSolaris infrastructure which needed
this. The upper zpl layer can specify the filesystem type.
The ZFS code is being restructured to act as a library and a stand
alone module. This allows us to leverage most of the existing code
with minimal modification. It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
Basic compilation of the bulk of zfs_znode.c has been enabled. After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces. The following commits
will systematically replace update the requiter interfaces. There
are of course pros and cons to this decision.
Pros:
* This simplifies intergration with Linux in the long term. There is
no longer any need to manage vnodes which are a foreign concept to
the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.
Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common(). Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file. This way I don't
need to reimplement this functionality from scratch in the SPL.
Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed. If
that turns out to be the case they can then be removed.
Remove unneeded bootfs functions. This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
Certain NFS/SMB share functionality is not yet in place. These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled. I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL. So I'm just
renaming the wrapper here to HAVE_SHARE. They still won't be
compiled until all the share issues are worked through. Share
support is the last missing piece from zfs_ioctl.c.
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode. So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches. Until proven that this is a problem it's
just added complexity we don't need.