zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#370
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks. When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#457
Linux kernel 3.17 removes the action function argument from
wait_on_bit(). Add autoconf test and compatibility macro to support
the new interface.
The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule(). This API change
was made to consolidate all of those redundant wait functions.
References: torvalds/linux@7431620
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#378
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage. Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.
To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure. All
access must be serialized through the vq_lock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2572
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Description from Matt Ahrens's bug report at Delphix:
Add a new zfs property, "redundant_metadata" which can have values
"all" or "most". The default will be "all", which is the current
behavior. Setting to "most" will cause us to only store 1 copy of
level-1 indirect blocks of user data files.
Additional notes:
The new man page section for this property states
"The exact behavior of which metadata blocks
are stored redundantly may change in future releases."
and:
"When set to most, ZFS stores an extra copy of most types of
metadata. This can improve performance of random writes,
because less metadata must be written."
The current implementation is as described above in Matt's blog.
It is controlled by a new global integer
"zfs_redundant_metadata_most_ditto_level", currently initialized
to 2. When "redundant_metadata" is set to "most", only indirect
blocks of the specified level and higher will have additional ditto
blocks created.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2542
The atomic_swap_32() function maps to atomic_xchg(), and
the atomic_swap_64() function maps to atomic64_xchg().
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#377
This loop in dmu_objset_write_ready():
for (i = 0; i < dnp->dn_nblkptr; i++)
bp->blk_fill += dnp->dn_blkptr[i].blk_fill;
invokes _undefined behavior_ for the (common) case of dn_nblkptr=3,
therefore, the compiler is free to do whatever it wants (such as
optimizing it away, or otherwise messing up your expections).
The fix is to be honest about the array size.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2511Closes#2010
Added highbit64() and howmany() which are used in recent upstream
code. Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#363
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot. If they do somehow get
dirtied an attempt will make made to write them to disk. In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2405
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.
The following macros have been converted to not use do/while(0):
PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL
PANIC has been converted to a wrapper around the new spl_PANIC() function.
The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().
The __ASSERT() macro was not touched. It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Some kernel definitions were buried inside the #if... #endif logic for
ACLs. When ACLs are not available these definitions get lost causing
the build to fail.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2349
The correct behavior for all registered shrinkers is to return the
number of objects in their cache. In theory this allows the Linux
VM to balance memory reclaim across all registered caches.
In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim. This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim. Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.
Therefore, this patch makes this behavior tunable. Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once. This is the default behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes#358
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL. These include:
1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.
Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator. This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on. However, there are some things we need to be careful of:
1) The Linux slab allocator was never designed to work well with
large objects. Because the SPL slab must still handle this use
case a cut off limit was added to transition from Linux slab
backed objects to kmem or vmem backed slabs.
spl_kmem_cache_slab_limit - Objects less than or equal to this
size in bytes will be backed by the Linux slab. By default
this value is zero which disables the Linux slab functionality.
Reasonable values for this cut off limit are in the range of
4096-16386 bytes.
spl_kmem_cache_kmem_limit - Objects less than or equal to this
size in bytes will be backed by a kmem slab. Objects over this
size will be vmem backed instead. This value defaults to
1/8 a page, or 512 bytes on an x86_64 architecture.
2) Be aware that using the Linux slab may inadvertently introduce
new deadlocks. Care has been taken previously to ensure that
all allocations which occur in the write path use GFP_NOIO.
However, there may be internal allocations performed in the
Linux slab which do not honor these flags. If this is the case
a deadlock may occur.
The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us. And ideally
need to move completely away from using the SPLs slab for large
memory allocations. This patch is a first step.
NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
backed caches but it is not supported for kmem/vmem backed caches.
2) Regardless of the spl_kmem_cache_*_limit settings a cache may
be explicitly set to a given type by passed the KMC_KMEM,
KMC_VMEM, or KMC_SLAB flags during cache creation.
3) The constructors, destructors, and reclaim callbacks are all
functional and will be called regardless of the cache type.
4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
the issues involved in presenting correct object accounting.
Instead they will appear in /proc/slabinfo under the same names.
5) Several kmem SPLAT tests needed to be fixed because they relied
incorrectly on internal kmem slab accounting. With the updated
test cases all the SPLAT tests pass as expected.
6) An autoconf test was added to ensure that the __GFP_COMP flag
was correctly added to the default flags used when allocating
a slab. This is required to ensure all pages in higher order
slabs are properly refcounted, see ae16ed9.
7) When using the SLUB allocator there is no need to attempt to
set the __GFP_COMP flag. This has been the default behavior
for the SLUB since Linux 2.6.25.
8) When using the SLUB it may be desirable to set the slub_nomerge
kernel parameter to prevent caches from being merged.
Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#356
Restructure the zfsdev_state_list to allow for lock-free reading by
converting to a simple singly-linked list from which items are never
deleted and over which only forward iterations are performed. It depends
on, among other things, the atomicity of accessing the zs_minor integer
and zs_next pointer.
This fixes a lock inversion in which the zfsdev_state_lock is used by
both the sync task (txg_sync) and indirectly by any user program which
uses /dev/zfs; the zfsdev_release method uses the same lock and then
blocks on the sync task.
The most typical failure scenerio occurs when the sync task is cleaning
up a user hold while various concurrent "zfs" commands are in progress.
Neither Illumos nor Solaris are affected by this issue because they use
DDI interface which provides lock-free reading of device state via the
ddi_get_soft_state() function.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2301
Originally, vdev_file used system_taskq. This would cause a deadlock,
especially on system with few CPUs. The reason is that the prefetcher
threads, which are on system_taskq, will sometimes be blocked waiting
for I/O to finish. If the prefetcher threads consume all the tasks in
system_taskq, the I/O cannot be served and thus results in a deadlock.
We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2270
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
This implements a subset of the LWP rwlock interface by wrapping the
equivalent POSIX thread interface. It is a superset of the features
needed by ztest.
The missing bits are {,_}rw_read_held() and {,_}rw_write_held().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1970
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().
Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method. That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.
References:
https://bugs.gentoo.org/show_bug.cgi?id=483516
Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1691
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).
As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#351
Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion
from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2142
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.
This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#335
This macro makes the compile to spit "mixed definition and code"
warning, I can't find a way to avoid it.
This patch lays some groundwork for the persistent l2arc feature.
See https://www.illumos.org/issues/3525.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#303
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist
Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#224
rq_for_each_segment changed from taking bio_vec * to taking bio_vec.
We provide rq_for_each_segment4 which takes both.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter.
This patch creates BIO_BI_*(bio) macros to hide the differences.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod}
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2124
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#346
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create()
to provide pre-3.13 semantics. Re-try if the call is interrupted or if it
would have returned -ENOMEM. Otherwise return NULL.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#339
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.
To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev. This covers both io and checksum zevents but
may be used but other scripts.
In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:
vdev_spare_paths - List of vdev paths for all hot spares.
vdev_spare_guids - List of vdev guids for all hot spares.
vdev_read_errors - Read errors for the problematic vdev
vdev_write_errors - Write errors for the problematic vdev
vdev_cksum_errors - Checksum errors for the problematic vdev.
By default the required hot spare scripts are installed but this
functionality is disabled. To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.
These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system. It also requires that the
autoexpand policy be ported from Illumos, see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c
Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter. However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next(). Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.
This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter. It also defines ZEVENT_NONE
to specify the default behavior. Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode. This, in effect, inverts the previous use
of the "block" parameter. Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Tagging each zevent with a unique monotonically increasing EID
(Event IDentifier) provides the required infrastructure for a user
space daemon to reliably process zevents. By writing the EID to
persistent storage the daemon can safely resume where it left off
in the event stream when it's restarted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
The nreserved column in the txgs kstat file always contains 0
following the write throttle restructuring of commit
e8b96c6007.
Prior to that commit, the nreserved column showed the number of bytes
temporarily reserved in the pool by a transaction group at sync time.
The new write throttle did away with temporary reservations and uses
the amount of dirty data instead. To approximate the old output of
the txgs kstat, the number of dirty bytes per-txg was passed in as
the nreserved value to spa_txg_history_set_io(). This approach did
not work as intended, because the per-txg dirty value is decremented
as data is written out to disk, so it is zero by the time we call
spa_txg_history_set_io(). To fix this, save the number of dirty
bytes before calling spa_sync(), and pass this value in to
spa_txg_history_set_io().
Also, since the name "nreserved" is now a misnomer, the column
heading is now labeled "ndirty".
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1696
A few counters in the dmu_tx kstats are obsolete or no longer
bumped properly.
- The sync task restructuring commit
13fe019870 removed the code
that bumpted dmu_tx_quota. The counter is now bumped in two
cases, instead of just the one case as before (after the result
of dsl_dataset_check_quota call). The second case is where
we check the requested reservation against the actual pool size,
as this is an implicit quota of sorts.
- The write throttle restructuring commit
e8b96c6007 makes dmu_tx_how and
dmu_tx_inflight obsolete, so they are removed.
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1914
Previously, the "data_size" field in the arcstats kstat contained the
amount of cached "metadata" and "data" in the ARC. The problem is this
then made it difficult to extract out just the "metadata" size, or just
the "data" size.
To make it easier to distinguish the two values, "data_size" has been
modified to count only buffers of type ARC_BUFC_DATA, and "meta_size"
was added to count only buffers of type ARC_BUFC_METADATA. If one wants
the old "data_size" value, simply sum the new "data_size" and
"meta_size" values.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
To maintain a strict limit on the metadata contained in the arc, while
preventing the arc buffer headers from completely consuming the
"arc_meta_used" space, we need to evict metadata buffers from the arc's
ghost lists along with the regular lists.
This change modifies arc_adjust_meta such that it more closely models
the adjustments made in arc_adjust. "arc_meta_used" is used similarly to
"arc_size", and "arc_meta_limit" is used similarly to "arc_c".
Testing metadata intensive workloads (e.g. creating, copying, and
removing millions of small files and/or directories) has shown this
change to make a dramatic improvement to the hit rate maintained in the
arc. While I think there is still room for improvement, this is a big
step in the right direction.
In addition, zpl_free_cached_objects was made into a no-op as I'm not
yet sure how to properly implement that function.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2110
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit. Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.
The additional field has been removed by the previous commit. However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied. This will return the pools to a state in which they
may be imported by other implementations.
The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted. A message similar to one of the following means your
pool must be scrubbed to restore compatibility.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
however there is a compatibility issue which should be corrected
by running 'zpool scrub'
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
$ zpool status
pool: zol-0.6.2-173
state: ONLINE
scan: pool compatibility issue detected.
see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...
If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported. Further advice on how to proceed will be
provided by the error message as follows.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
active asynchronous destroy. Revert to an earlier version and
allow the destroy to complete before updating.
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:
zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w
Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.
The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys. Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.
This patch does have one limitation that should be mentioned. It will not
detect errata #2 on a pool unless errata #1 is also present. It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.
End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL. The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root. This will display a non-zero
value for any pool with an active asynchronous destroy.
Lastly, it is expected that no user data has been lost as a result of
this erratum.
Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094