For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
For consistency with disk vdevs honor the zfs_nocacheflush tunable.
This setting is available primarily for debugging and performance
analysis.
Signed-off-by: HC <mmttdebbcc@yahoo.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2336
In the case where a variable-sized SA overlaps the spill block pointer and
a new variable-sized SA is being added, the header size was improperly
calculated to include the to-be-moved SA. This problem could be
reproduced when xattr=sa enabled as follows:
ln -s $(perl -e 'print "x" x 120') blah
setfattr -n security.selinux -v blahblah -h blah
The symlink is large enough to interfere with the spill block pointer and
has a typical SA registration as follows (shown in modified "zdb -dddd"
<SA attr layout obj> format):
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ]
Adding the SA xattr will attempt to extend the registration to:
[ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ]
but since the ZPL_SYMLINK SA interferes with the spill block pointer, it
must also be moved to the spill block which will have a registration of:
[ ZPL_SYMLINK ZPL_DXATTR ]
This commit updates extra_hdrsize when this condition occurs, allowing
hdrsize to be subsequently decreased appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2214
Issue #2228
Issue #2316
Issue #2343
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2301
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.
We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2270
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com>
Closes#1707Closes#1987Closes#1891
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
A minor style issue was accidentally introduced by aa7d06a.
This change resolves that style problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Today the metaslab_debug logic performs two tasks:
- load all metaslabs on import/open
- don't unload metaslabs at the end of spa_sync
This change provides knobs for each of these independently.
References:
https://illumos.org/issues/4101https://github.com/illumos/illumos-gate/commit/0713e23
Notes:
1) This is a small piece of the metaslab improvement patch from
Illumos. It was worth bringing over before the rest, since it's
low risk and it can be useful on fragmented pools (e.g. Lustre
MDTs). metaslab_debug_unload would give the performance benefit
of the old metaslab_debug option without causing unwanted delay
during pool import.
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2227
When the system attributes (SAs) for an object exceed what can
can be stored in the bonus area of a dnode a spill block is
allocated. These spill blocks are currently considered data
blocks. However, they should be accounted for as meta data
because they are effectively an extension of the dnode.
While this may seem like a minor accounting issue it has broader
implications. The key thing to be aware of is that each spill
block will hold a reference on its parent dnode. The dnode in
turn holds a reference on its dbuf in the dnode object. This
means that a single 512 byte data buffer for a spill block can
pin over 16k of meta data. This is analogous to the small file
situation described in 2b13331 where a relatively small number
of data buffer can cause the ARC to exceed the meta limit.
However, unlike the small file case a spill block can legitimately
be considered meta data. By changing the spill block to meta data
they will now be dropped from the cache when the meta limit is
reached. This then allows the dnodes and dbufs which the spill
block was pinning to be released.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#2294
As described in the comment above dmu_tx_assign() this function must
only fail if the pool is out of space. If for some other reason the
TX cannot be assigned (such as memory pressure) ERESTART must be
returned. Alternately, EAGAIN could be returned to inject a delay
but that isn't required because the caller will block on the condition
variable waiting for the next TXG.
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* It will only fail if we're truly out of space (or over quota).
* ...
*/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2287
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().
Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.
References:
https://bugs.gentoo.org/show_bug.cgi?id=483516
Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1691
Some network related block device uses tcp_sendpage, which doesn't
behave well when using 0-count page. Add assertion to catch them.
This has a runtime dependency on:
zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2277
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.
The implementation based on assumption that task queues initialization
performed in serial manner.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#336
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.
The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#251
Endianness detection in LZ4 is broken in user-space builds. This
bug corrupts compressed data and manifests itself in several ztest
failures. When LZ4 was originally ported to Illumos ZFS, the proper
checks for Linux were stripped out. The Linux port then inherited
the remaining detection code that works on Illumos but not on Linux.
The current LZ4 endianness check misuses the condition
defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux
__BIG_ENDIAN is defined uncondtionally in the user-space header
/usr/include/endian.h, regardless of the endianness of the system.
The kernel does not use this header, so only user-space builds are
affected.
While we could fix this by restoring the upstream LZ4 endianness
detection code, reliable checks already exist in
libspl/include/sys/isa_defs.h. This change uses the libspl results
to replace the word-size and endianness checks in LZ4, simplifying
the code and reducing duplication.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#1963Fixes#1964Fixes#1965
Due to an asymmetry in the kmem accounting a memory leak was being
reported when it was only an accounting issue. All memory allocated
with kmem_alloc() must be released with kmem_free() or it will not
be properly accounted for.
In this case the code used strfree() to release the memory allocated
by kmem_alloc(). Presumably this was done because the size of the
memory region wasn't available when the memory needed to be freed.
To resolve this issue the code has been updated to use strdup() instead
of kmem_alloc() to allocate the memory. Like strfree(), strdup() is
not integrated with the memory accounting. This means we can use
strfree() to release it like Illumos.
SPL: kmem leaked 10/4368729 bytes
address size data func:line
ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#2262
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#349
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2142
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.
Problem:
Reasons for very high number of virtual memory allocations:
* Real slab size very small relative to the size of the entire RAM
* Slabs allocated on virtual memory and fill entire ARC
The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.
Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#337
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.
This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#335
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist
Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#224
The kernel's kthread_create() function is defined as "..." and there is
no va_list variant at the moment. The task name is pre-formatted into
a local buffer and passed to kthread_create() with fixed arguments.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#347
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
After the dmu_req_copy change, bi_io_vecs are not touched, so this is no
longer needed.
This reverts commit e26ade5101.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it
can continue in subsequent passes. However, after the immutable biovec changes
in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how
many bytes are already copied and it will skip to the right spot accordingly.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#346
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create()
to provide pre-3.13 semantics. Re-try if the call is interrupted or if it
would have returned -ENOMEM. Otherwise return NULL.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#339
Due to certain assumptions made in the the cred:groupmember test it
could result in false positives when run on specific distributions.
This was solely a bug in the test case and not in the groupmember()
function which the test case was validating.
To prevent future false positives the test case has been rewritten
to be both more rigerous and to make fewer assumptions about the
system.
Minor style cleanup was done to cr_groups_search() and groupmember()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By setting __GFP_NORETRY the kernel memory reclaim logic was allowed to
abort early and dump a falled allocation stack to the console. Since
this was done in a tight loop to fill memory it could result in a large
number of stacks being dumped to the console. This in turn slowed down
the test sufficiently so it exceeded the time limit and failed.
To resolve this issue the __GFP_NORETRY flag is being removed. This is
how it should have been called originally to ensure we're simulating
the behavior of most callers which will use the GFP_KERNEL flag.
In addition, the reclaim granularity of 1000 objects was far to coarse
for this to be a realistic test. For kmem:slab_reclaim there might
only be a few thousand objects total in the cache. Therefore, the
SPLAT_KMEM_OBJ_RECLAIM constant for these tests was lowered. This
will cause the reclaim callback to run more frequently which makes
for a better test case.
The frequency of the cache reaping in kmem:slab_reap was increased
to accommodate the reduced number of objects released during the
reclaim.
These changes only impact the test cases and were done to remove
false positives caused by the test case itself.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted. If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources. Unfortunately, this introduced another deadlock.
INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
Tainted: P O 3.13.6 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
Call Trace:
[<ffffffff813dd1d4>] schedule+0x24/0x70
[<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[<ffffffff811608c5>] ilookup+0x65/0xd0
[<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[<ffffffff811071ec>] do_writepages+0x1c/0x40
[<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[<ffffffff8106072c>] process_one_work+0x16c/0x490
[<ffffffff810613f3>] worker_thread+0x113/0x390
[<ffffffff81066edf>] kthread+0xdf/0x100
This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks. Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used. This gives us the information that ilookup() provided without
the risk of a deadlock.
Alternately, this race could be closed by registering an sops->drop_inode()
callback. The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.
To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev. This covers both io and checksum zevents but
may be used but other scripts.
In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:
vdev_spare_paths - List of vdev paths for all hot spares.
vdev_spare_guids - List of vdev guids for all hot spares.
vdev_read_errors - Read errors for the problematic vdev
vdev_write_errors - Write errors for the problematic vdev
vdev_cksum_errors - Checksum errors for the problematic vdev.
By default the required hot spare scripts are installed but this
functionality is disabled. To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.
These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system. It also requires that the
autoexpand policy be ported from Illumos, see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c
Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location. When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor. Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor. When the file descriptor is
closed the kernel state tracking the cursor is destroyed.
This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
Remove the redundant call to zfs_unmount_snap() which was being
done after char array was freed,
This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@d09f25dc66, which
ported the Illumos 3744 changes.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2156
The spa_label_features nvlist is used in the sync context during pool
version upgrade.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2168
These are needed by consumers (i.e. Lustre) who wish to perform a
callback in the syncing context. Both a blocking and non-blocking
version are available to callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call
dmu_tx_wait() themselves before retrying if the assignment fails.
The wait times for such callers are not accounted for in the
dmu_tx_assign kstat histogram, because the histogram only records
time spent in dmu_tx_assign(). This change moves the histogram
update to dmu_tx_wait() to properly account for all time spent there.
One downside of this approach is that it is possible to call
dmu_tx_wait() multiple times before successfully assigning a
transaction, in which case the cumulative wait time would not be
recorded. However, this case should not often arise in practice,
because most callers currently use one of these forms:
dmu_tx_assign(tx, TXG_WAIT);
dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT);
The first form should make just one call to dmu_tx_delay() inside of
dmu_tx_assign(). The second form retries with TXG_WAITED if the first
assignment fails and incurs a delay, in which case no further waiting
is performed. Therefore transaction delays normally occur in one
call to dmu_tx_wait() so the histogram should be fairly accurate.
Another possible downside of this approach is that the histogram will
no longer record overhead outside of dmu_tx_wait() such as in
dmu_tx_try_assign(). While I'm not aware of any reason for concern on
this point, it is conceivable that lock contention, long list
traversal, etc. could cause assignment delays that would not be
reflected in the histogram. Therefore the histogram should strictly
be used for visibility in to the normal delay mechanisms and not as a
profiling tool for code performance.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1915
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.
Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead. To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io(). This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io(). To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().
Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.
- The sync task restructuring commit
13fe019870 removed the code
that bumpted dmu_tx_quota. The counter is now bumped in two
cases, instead of just the one case as before (after the result
of dsl_dataset_check_quota call). The second case is where
we check the requested reservation against the actual pool size,
as this is an implicit quota of sorts.
- The write throttle restructuring commit
e8b96c6007 makes dmu_tx_how and
dmu_tx_inflight obsolete, so they are removed.
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1914